facebookresearch / DPR

Dense Passage Retriever - is a set of tools and models for open domain Q&A task.
Other
1.71k stars 301 forks source link

Error when running train_reader -- ValueError: a must be greater than 0 unless no samples are taken #25

Closed aarzchan closed 4 years ago

aarzchan commented 4 years ago

Hi! I get the following error when running train_reader.py:

Total iterations per epoch=1237
 Total updates=24720
  Eval step = 2000
***** Training *****
***** Epoch 0 *****
Traceback (most recent call last):
  File "train_reader.py", line 507, in <module>
    main()
  File "train_reader.py", line 498, in main
    trainer.run_train()
  File "train_reader.py", line 126, in run_train
    global_step = self._train_epoch(scheduler, epoch, eval_step, train_iterator, global_step)
  File "train_reader.py", line 225, in _train_epoch
    is_train=True, shuffle=True)
  File "/home/aarchan/dpr_aug/DPR-Aug/dpr/models/reader.py", line 134, in create_reader_input
    is_random=shuffle)
  File "/home/aarchan/dpr_aug/DPR-Aug/dpr/models/reader.py", line 193, in _create_question_passages_tensors
    positive_idx = _get_positive_idx(positives, max_len, is_random)
  File "/home/aarchan/dpr_aug/DPR-Aug/dpr/models/reader.py", line 175, in _get_positive_idx
    positive_idx = np.random.choice(len(positives)) if is_random else 0
  File "mtrand.pyx", line 894, in numpy.random.mtrand.RandomState.choice
ValueError: a must be greater than 0 unless no samples are taken

This error occurs right after train_reader.py successfully loads all of the preprocessed .pkl reader data files. Could you please help me resolve this issue?

vlad-karpukhin commented 4 years ago

Hi Aaron, looks like one of the samples have 0 positive passages and that breaks our reader batch logic. Are you using your own dataset or somehow modified our provided ones?

aarzchan commented 4 years ago

No, I'm using the NQ dataset without any modification to the code or dataset. As far as I can tell, I've used the recommended args in all of the previous steps, except changing batch_size and gradient_accumulation_steps.

Also, I tried running train_reader.py on TriviaQA without preprocessing (since no gold passage info is provided for TriviaQA), but I instead get the following error:

***** Training *****
***** Epoch 0 *****
Traceback (most recent call last):
  File "train_reader.py", line 507, in <module>
    main()
  File "train_reader.py", line 498, in main
    trainer.run_train()
  File "train_reader.py", line 126, in run_train
    global_step = self._train_epoch(scheduler, epoch, eval_step, train_iterator, global_step)
  File "train_reader.py", line 225, in _train_epoch
    is_train=True, shuffle=True)
  File "/home/aarchan/qa-aug/qa-aug/dpr/models/reader.py", line 134, in create_reader_input
    is_random=shuffle)
  File "/home/aarchan/qa-aug/qa-aug/dpr/models/reader.py", line 205, in _create_question_passages_tensors
    positive_input_ids = _pad_to_len(positives[positive_idx].sequence_ids, pad_token_id, max_len)
  File "/home/aarchan/qa-aug/qa-aug/dpr/models/reader.py", line 162, in _pad_to_len
    s_len = seq.size(0)
TypeError: 'int' object is not callable
vlad-karpukhin commented 4 years ago

The second error says that preprocessed question+passage tokens tensor is an int number instead of tensor. There seems to be something wrong with the pickled files. Which resource do you use - already preprocessed .pkl files from our downloader or json file which were pickled when you started the training (i.e what is your --train_file pointing to)?

Both should be fine, just want to narrow the scope to reproduce your issue.

aarzchan commented 4 years ago

For --train_file in the TriviaQA run, I used the json file output by dense_retriever.py.

vlad-karpukhin commented 4 years ago

Thanks Aaron, will try to reproduce it today

aarzchan commented 4 years ago

Thanks for looking into this!

vlad-karpukhin commented 4 years ago

We actually only provide already preprocessed Trivia data for the reader training since our best schema was to use hybrid retriever results for that dataset. Hybrid means bm25+dense. I downloaded our data.reader.trivia.multi-hybrid.train resource and was able to start the reader training without any issues.

How did you generate retriever results for Trivia? Specifically, i need to know which bi-encoder checkpoint you used and how you created wikipedia encodings (our provided or generated yourself?).

It would also be great if you share (may be only a pert of) your json file with the retriever results.

aarzchan commented 4 years ago

To generate retriever results for TriviaQA:

Here is a snippet of the retriever results json file for the train set:

[ { "question": "Who was President when the first Peanuts cartoon was published?", "answers": [ "Presidency of Harry S. Truman", "Hary truman", "Harry Shipp Truman", "Harry Truman's", "Harry S. Truman", "Harry S.Truman", "Harry S Truman", "H. S. Truman", "President Harry Truman", "Truman administration", "Presidency of Harry Truman", "Mr. Citizen", "HST (president)", "H.S. Truman", "Mary Jane Truman", "Harry Shippe Truman", "S truman", "Harry Truman", "President Truman", "33rd President of the United States", "Truman Administration", "Harry Solomon Truman", "Harold Truman", "Harry truman", "H. Truman" ],
"ctxs": [ {
"id": "13205656", "title": "Daffy Duck for President", "text": "and Daffy. It was also planned for a worldwide theatrical release, but these plans were aborted after \"\" resulted in a financial disaster. However, it saw eventual release as a bonus feature on disc 3 of the \"\". The short is also available as a bonus feature on \"The Essential Daffy Duck\" DVD set. Daffy Duck for President Daffy Duck for President is a children's book, published by Warner Bros. and the United States Postal Service in 1997 to coincide with the release of the first Bugs Bunny U.S. postage stamp. The book was written and illustrated by Chuck Jones,", "score": "77.905495", "has_answer": false },
{
"id": "13205655", "title": "Daffy Duck for President", "text": "the explosive fun of the author's art, \"Daffy Duck for President\" is presented in its original sketch form. It's pure Chuck Jones - brash, witty, and reflective - in a bold flash of pencil, paper, and impulse.\" In 2004, Warner Bros. released a four-minute animated short of the same name based on the book, coinciding with the Presidential election that year. The film was produced by Spike Brandt, Tony Cervone, and Linda M. Steiner, and was dedicated to Chuck Jones (who died in 2002). It was considered for a 2005 Academy Award for Best Animated Short. Joe Alaskey voiced Bugs", "score": "75.750336", "has_answer": false },
{
"id": "13205653", "title": "Daffy Duck for President", "text": "Daffy Duck for President Daffy Duck for President is a children's book, published by Warner Bros. and the United States Postal Service in 1997 to coincide with the release of the first Bugs Bunny U.S. postage stamp. The book was written and illustrated by Chuck Jones, edited by Charles Carney, and art directed by Allen Helbig. Echoing the popular \"Rabbit Season/Duck Season\" scenes from the Bugs Bunny/Daffy Duck/Elmer Fudd shorts \"Rabbit Fire\", \"Rabbit Seasoning\" and \"Duck! Rabbit, Duck!\", the book tells how Daffy Duck, in an effort to outlaw Duck Season in favor of a perpetual Rabbit Season, attempts to", "score": "75.57904", "has_answer": false },
{
"id": "8099128", "title": "Samuel Adams Wiggin", "text": "Samuel Adams Wiggin Samuel Adams Wiggin (1832\u20131899) was an American poet. Samuel Wiggin pursued a military career until Vice President Andrew Johnson became president after the assassination of President Lincoln in 1865. He was then appointed Executive Clerk to President Johnson. A position which he held for eight years and three months. Many of his poems were published in various newspapers of the day and were written in Poets' Corner in the White House. His published work has over 200 poems, including a satirical piece against Darwin's \"On the Origin of Species\". Other subjects include patriotism, the Rebellion (civil war),", "score": "74.72014", "has_answer": false },

Here is a snippet of the retriever results json file for the dev set:

[ { "question": "The VS-300 was a type of what?", "answers": [ "\ud83d\ude81", "Helicopters", "Civilian helicopter", "Pescara (helicopter)", "Cargo helicopter", "Copter", "Helecopter", "List of deadliest helicopter crashes", "Helichopper", "Helocopter", "Cargo Helicopter", "Helicopter", "Helicoptor", "Anatomy of a helicopter" ],
"ctxs": [ {
"id": "17508178", "title": "AVIC VSTOL UAVs", "text": "AVIC VSTOL UAVs AVIC VSTOL UAVs are Chinese VSTOL experimental UAVs developed by Aviation Industry Corporation of China (AVIC). Although all of these proof-of-concept designs are currently unmanned, the developer has claimed that if proved to be successful, some of them might be developed into manned versions in the future. Blue Whale (Lan-Jing or Lanjing, \u84dd\u9cb8) is a projected Chinese quad tilt-rotor proposed by China Helicopter Research and Development Institute (CHRDI, \u4e2d\u822a\u5de5\u4e1a\u76f4\u5347\u673a\u8bbe\u8ba1\u7814\u7a76\u6240) of AVIC. The eventual goal is to have an VTOL aircraft with takeoff weight of 60 tons, and payload of 20 to 30 tons, speed of 538 km/hr,", "score": "82.161354", "has_answer": true },
{
"id": "8292890", "title": "VS-N fuzed mines", "text": "VS-N fuzed mines The VS-2.2, VS-3.6 and SH-55 are Italian circular plastic cased anti-tank blast mines that use the VS-N series fuze. They have very few metal components and are resistant to overpressure and shock. The VS-2.2 and VS-3.6 can also be deployed from helicopters. It was produced by Valsella Meccanotecnica and Singapore, but production has ceased. The VS-2.2 and VS-3.6 are essentially the same, the VS-3.6 being slightly larger, the SH-55 is larger still and has a more rounded appearance. A smaller mine, the VS-1.6 also uses the same fuze. The mines are normally olive green or sand coloured", "score": "81.56846", "has_answer": true },
{
"id": "7108855", "title": "Vought-Sikorsky VS-300", "text": "was found that the VS-300 could not fly forward easily and Sikorsky joked about turning the pilot's seat around. Sikorsky fitted utility floats (also called pontoons) to the VS-300 and performed a water landing and takeoff on 17 April 1941, making it the first practical amphibious helicopter. On 6 May 1941, the VS-300 beat the world endurance record held by the Focke-Wulf Fw 61, by staying aloft for 1 hour 32 minutes and 26.1 seconds. The final variant of the VS-300 was powered by a 150 hp Franklin engine. The VS-300 was one of the first helicopters capable of carrying", "score": "81.09697", "has_answer": true },
{
"id": "8435447", "title": "VF-301", "text": "VF-301 VF-301 Fighter Squadron 301 was an aviation unit of the United States Naval Reserve in service from 1970 to 1994. The squadron's nickname was \"Devil's Disciples\". VF-301 was activated on 1 October 1970 and assigned to Carrier Air Wing Reserve 30 (CVWR-30) (tail code \"ND\") at Naval Air Station Miramar, California (USA). The squadron first flew the F-8L Crusader and later transitioned to the F-4B Phantom in 1974. Their time with the B-models was short and VF-301 soon received the F-4N in February 1975. In 1980, VF-301 received the most advanced F-4 variant in the U.S. Navy, the F-4S.", "score": "80.76257", "has_answer": false },
{
"id": "2858179", "title": "S-300 missile system", "text": "\u0421-300\u0412 \u0410\u043d\u0442\u0435\u0439-300\" \u2013 named after \"Antaeus\", NATO reporting name \"SA-12 Gladiator/Giant\") varies from the other designs in the series. This complex is not part of the C-300, including is designed by another developer. It was built by Antey rather than Almaz, and its 9M82 and 9M83 missiles were designed by NPO Novator. The \"V\" suffix stands for \"Voyska\" (ground forces). It was designed to form the top tier army air defence system, providing a defence against ballistic missiles, cruise missiles and aircraft, replacing the \"SA-4 Ganef\". The \"GLADIATOR\" missiles have a maximum engagement range of around while the \"GIANT\" missiles", "score": "80.692894", "has_answer": false },

vlad-karpukhin commented 4 years ago

I merged 2 samples above into a single json file and then run train_reader.py, it correctly skipped the first sample since it has no positives, serialized and deserialized the data correctly and the training started as expected without any issues. My only current guess is that there is some strange pickle behavior in your case.

Can you share your pickled results file somehow?

You may just pick several samples from json to create another json file and use it as the --train_file. Start training and see it preprocesses and serializes that file. You should see the following messages in the train output log:


Data are not preprocessed for reader training. Start pre-processing ... ... Preprocessed data stored in {path to pkl files}

And then upload one of those .pkl files in a branch here.

aarzchan commented 4 years ago

I followed your instructions and made a smaller json file out of the first five samples of the trivia-dev json file generated from dense_retriever.py. When running train_reader.py without first doing preprocessing, I got the messages you copied above, and the json files seemed to be serialized successfully. However, this time I get a different error:

Reading file ../save/baseline_dpr/retriever_results/trivia/ret_results__trivia-dev_tmp.3.pkl
    main()
  File "train_reader.py", line 498, in main
Traceback (most recent call last):
  File "train_reader.py", line 507, in <module>
    trainer.run_train()
  File "train_reader.py", line 126, in run_train
    global_step = self._train_epoch(scheduler, epoch, eval_step, train_iterator, global_step)
  File "train_reader.py", line 225, in _train_epoch
    is_train=True, shuffle=True)
  File "/home/aarchan/qa-aug/DPR/dpr/models/reader.py", line 144, in create_reader_input
    main()
  File "train_reader.py", line 498, in main
    input_ids = torch.cat([ids.unsqueeze(0) for ids in input_ids], dim=0)
RuntimeError: There were no tensor arguments to this function (e.g., you passed an empty list of Tensors), but no fallback function is registered for schema aten::_cat.  This usually means that this function requires a non-empty list of Tensors.  Available functions are [CPUTenso$
Id, CUDATensorId, QuantizedCPUTensorId, VariableTensorId]
    trainer.run_train()
  File "train_reader.py", line 126, in run_train
    global_step = self._train_epoch(scheduler, epoch, eval_step, train_iterator, global_step)
  File "train_reader.py", line 225, in _train_epoch
    is_train=True, shuffle=True)
  File "/home/aarchan/qa-aug/DPR/dpr/models/reader.py", line 144, in create_reader_input
Aggregated data size: 2
    input_ids = torch.cat([ids.unsqueeze(0) for ids in input_ids], dim=0)
RuntimeError: There were no tensor arguments to this function (e.g., you passed an empty list of Tensors), but no fallback function is registered for schema aten::_cat.  This usually means that this function requires a non-empty list of Tensors.  Available functions are [CPUTenso$
Id, CUDATensorId, QuantizedCPUTensorId, VariableTensorId]

I created a new branch containing the json and .pkl files, but I don't have permission to push my branch:

remote: Permission to facebookresearch/DPR.git denied to aarzchan.
fatal: unable to access 'https://github.com/facebookresearch/DPR.git/': The requested URL returned error: 403 

Is there another way to send you my files?

aarzchan commented 4 years ago

Actually, I'll fork the DPR repo and upload my files there.

aarzchan commented 4 years ago

Here's my DPR fork, with the uploaded files in tmp_data: https://github.com/aarzchan/DPR

aarzchan commented 4 years ago

Hi @vlad-karpukhin, just wanted to follow up with an update. I tried running train_reader.py using the provided reader data, for both NQ and TriviaQA. This was done on a fresh clone of the DPR repo (installed with the minimum requirements given in setup.py), using the same args provided in the README. For both datasets, I get the following error:

Traceback (most recent call last):
  File "train_reader.py", line 507, in <module>
    main()
  File "train_reader.py", line 498, in main
    trainer.run_train()
  File "train_reader.py", line 126, in run_train
    global_step = self._train_epoch(scheduler, epoch, eval_step, train_iterator, global_step)
  File "train_reader.py", line 225, in _train_epoch
    is_train=True, shuffle=True)
  File "/home/aarchan/qa-aug/DPR/dpr/models/reader.py", line 134, in create_reader_input
    is_random=shuffle)
  File "/home/aarchan/qa-aug/DPR/dpr/models/reader.py", line 205, in _create_question_passages_tensors
    positive_input_ids = _pad_to_len(positives[positive_idx].sequence_ids, pad_token_id, max_len)
  File "/home/aarchan/qa-aug/DPR/dpr/models/reader.py", line 162, in _pad_to_len
    s_len = seq.size(0)
TypeError: 'int' object is not callable

This is the exact command I used for NQ: CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 train_reader.py --seed 42 --learning_rate 1e-5 --eval_step 2000 --do_lower_case --eval_top_docs 50 --encoder_model_type hf_bert --pretrained_model_cfg bert-base-uncased --train_file "../data/reader/nq/single/train*" --dev_file ../data/reader/nq/single/dev.pkl --warmup_steps 0 --sequence_length 350 --batch_size 16 --passages_per_question 24 --num_train_epochs 100000 --dev_batch_size 72 --passages_per_question_predict 50 --output_dir ../save/baseline_dpr/nq/read_ckpts

This is the exact command I used for TriviaQA: CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 train_reader.py --seed 42 --learning_rate 1e-5 --eval_step 2000 --do_lower_case --eval_top_docs 50 --encoder_model_type hf_bert --pretrained_model_cfg bert-base-uncased --train_file "../data/reader/trivia/multi-hybrid/train*" --dev_file ../data/reader/trivia/multi-hybrid/dev.pkl --warmup_steps 0 --sequence_length 350 --batch_size 16 --passages_per_question 24 --num_train_epochs 100000 --dev_batch_size 72 --passages_per_question_predict 50 --output_dir ../save/baseline_dpr/trivia/read_ckpts

Please let me know if you spot something incorrect, or if there are any other details I should know. Thanks!

vlad-karpukhin commented 4 years ago

Hi Aaron,

That is the same error as was before - positives[positive_idx].sequence_ids is supposed to be a tensor of tokenized BPE ids of (question+[sep] + passage_title + [sep] + passage_body). But somehow it is an int object. There may be some bug/feature with your pickle version or how it interacts with numpy (sequence_ids are numpy arrays before serialization).

I will try to reproduce using your pkl file but I'm on vacations this week and can do that next week only, sorry.

In the meantime, in order to narrow the scope, could you please try the provided data for the reader training ( resources starting from data.reader....)?

aarzchan commented 4 years ago

Hi Vlad,

No problem, enjoy your vacation!

Just to clarify, the results in my last message were obtained by using the provided reader data (i.e., data.reader.nq.single. and data.reader.trivia.multi-hybrid.). Thus, it seems like it's not an issue with my pkl file specifically. I can try checking my pickle version, etc., as you suggested.

By the way, which python and numpy versions are you using? My current env is python 3.6.10 and numpy 1.18.5.

aarzchan commented 4 years ago

Hi Vlad,

I did eventually figure out the issue (to some extent), but I just remembered to update you on this. For some reason, in reader._pad_to_len, the input seq is a tensor if I'm only training with a single gpu, but it's a numpy array if I use multi-gpu distributed training (leading to the error). As I mentioned in my previous message, this error occurs even when I use the provided reader data.

I haven't had time to dive deeper into this issue, so, to get the code running, I just implemented the quick and dirty workaround below:

def _pad_to_len(seq: T, pad_id: int, max_len: int):
    try:
        s_len = seq.size(0)
    except:
        s_len = len(seq)
        seq = torch.from_numpy(seq)
    if s_len > max_len:
        return seq[0: max_len]
    return torch.cat([seq, torch.Tensor().new_full((max_len - s_len,), pad_id, dtype=torch.long)], dim=0)

It'd be great if you could investigate why this is happening!

vlad-karpukhin commented 4 years ago

Hi Aaron, thank you for all your investigations.

I was able to reproduce your error. It appears we've never used the reader in the distributed DDP mode, only in DataParallel mode. It was more than enough for our experiments and we recommended to use it a such way in the instructions document.

The reason of the error is a bit subtle and I fixed it in this commit https://github.com/facebookresearch/DPR/commit/d663f68b25b180243eeabda40b74f5d55c34b0a3

Please see the details there. You seem to use 8 gpus, that should be enough to run the reader training in non distributed mode.

aarzchan commented 4 years ago

Hi Vlad,

Thanks for your follow-up! Just wondering, besides debugging being easier for DataParallel, why is it recommended to use DataParallel instead of DDP? This article says:

DataParallel is usually slower than DistributedDataParallel even on a single machine due to GIL contention across threads, per-iteration replicated model, and additional overhead introduced by scattering inputs and gathering outputs.

vlad-karpukhin commented 4 years ago

There are not a lot of multi-thread sync in this model, so the performance difference is negligible. Otherwise, there is no strong reason to recommend DP over DDP, we just used DP for the reader initially and then there was no real reason to push for DDP. I think I tried that couple if times and noted DDP uses more gpu memory for our model somehow and doesn't have notable performance difference. You can use it, sure. I believe it should work properly now. Let me know if you face any other troubles with it.

aarzchan commented 4 years ago

Okay, I see. I can confirm that DDP uses slightly more gpu memory than DP, although the difference is not enough for me to increase the batch size when using DP. I'm currently using eight 12GB gpus, and, for both DP and DDP in reader training, I can only fit one instance per gpu -- i.e., bsz=8 for DP, and bsz=1 for DDP.

I'll let you know if I experience any further issues. Thanks!

vlad-karpukhin commented 4 years ago

the batch size for the reader should not affect its final quality, contrary to the bi-encoder where batch size is critical. What matters for the reader is --passages_per_question parameter. As long as you set it to >=24 your results should be similar to ours. Probably learning rate adjustments will be needed if you change batch size. (less batch size usually mean one should decrease learning rate as well)

aarzchan commented 4 years ago

Thanks for the tips!

For the bi-encoder, the recommended DDP batch size is 16 on eight gpus with gradient_accumulation_steps=1, which is an effective batch size of 128. However, I'm using batch size 8 with gradient_accumulation_steps=2 on eight gpus, since I can't fit batch size 16 on my 12GB gpus. Although the effective batch size here is still technically 128, I'm wondering if doing this would lead to different results, since the number of in-batch negatives in this case is 63 instead of 127.

For the reader, I was previously using passages_per_question=20 with DDP batch size 1 (or DP batch size 8), and training was already somewhat slow using this setting (~1 hr per epoch). However, after increasing passages_per_question to 24, I'm no longer able to train at all, since I can't even fit a single data instance on my 12GB gpu. That is, I get out-of-memory errors for both DDP batch size >= 1 and DP batch size >= 1. The highest passages_per_question I can set without getting out-of-memory errors is 22, for which I use DP batch size 8 and gradient_accumulation_steps=2, but now the training is slower than before (~2 hrs per epoch).

Would appreciate any advice regarding these issues.

vlad-karpukhin commented 4 years ago

1). It would be interesting to know how it will affect the performance at the end. We didn't experiment with gradient_accumulation_steps!=1. 2). Ye, reader training is slow compared to the  retriever and usually takes x2 more time. Looks like your only option is to try to use passages_per_question=22, batch_size=8 and just wait for 2hx18 epochs = 36 hours. I think we also waited for 2 days for each reader training. Another option is to try to use more than one gpu servers. But then you will need to figure out how to run pytorch correctly so that it knows how to sync 2 physical servers.

aarzchan commented 4 years ago

Got it, thanks for your advice!

vlad-karpukhin commented 4 years ago

I believe the topic starting issue is fixed and this can be closed now.