Open ChiaraMC opened 3 years ago
Hi @ChiaraMC, thank you for spotting the bugs. Can you send a pull request
Hi,
Any update on this issue? I'm also having issue 2 but not sure what code to remove. Thanks!
@yangky11 I think you can just go ahead and remove these lines (that's what I did and have had no issues)
import pdb; pdb.set_trace()
# debug TODO: remove for final release
for idx in range(len(self.data)):
self.data[idx]["neg_paras"] = self.data[idx]["tfidf_neg"]
@ChiaraMC Hi Chiara,
Thanks for your suggestion! I tried removing these 3 lines but get the following error. It looks like neg_paras
should be set somewhere.
08/31/2021 19:09:29 - INFO - __main__ - Namespace(accumulate_gradients=1, adam_epsilon=1e-08, do_predict=False, do_train=True, eval_period=-1, fp16=True, fp16_opt_level='O1', gradient_accumulation_steps=1, init_checkpoint='', init_retriever='', iterations_per_loop=1000, k=38400, learning_rate=2e-05, local_rank=-1, m=0.999, max_c_len=300, max_grad_norm=2.0, max_q_len=70, max_q_sp_len=350, model_name='roberta-base', momentum=False, multi_vector=1, no_cuda=False, nq_multi=False, num_train_epochs=50, num_workers=30, output_dir='./logs/08-31-2021/test_run-seed16-bsz100-fp16True-lr2e-05-decay0.0-warm0.1-valbsz3000-sharedTrue-multi1-schemenone', predict_batch_size=3000, predict_file='/u/kaiyuy/pvl-mathqa/multihop_dense_retrieval/data/hotpot/hotpot_dev_with_neg_v0.json', prefix='test_run', rnn_retriever=False, save_checkpoints_steps=20000, scheme='none', seed=16, sent_level=False, shared_encoder=True, stop_drop=0, temperature=1, train_batch_size=100, train_file='/u/kaiyuy/pvl-mathqa/multihop_dense_retrieval/data/hotpot/hotpot_train_with_neg_v0.json', use_adam=False, warmup_ratio=0.1, weight_decay=0.0)
08/31/2021 19:09:30 - INFO - __main__ - device cuda n_gpu 8 distributed training False
08/31/2021 19:09:30 - INFO - transformers.configuration_utils - loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-config.json from cache at /u/kaiyuy/.cache/torch/transformers/e1a2a406b5a05063c31f4dfdee7608986ba7c6393f7f79db5e69dcd197208534.117c81977c5979de8c088352e74ec6e70f5c66096c28b61d3c50101609b39690
08/31/2021 19:09:30 - INFO - transformers.configuration_utils - Model config RobertaConfig {
"architectures": [
"RobertaForMaskedLM"
],
"attention_probs_dropout_prob": 0.1,
"bos_token_id": 0,
"eos_token_id": 2,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"layer_norm_eps": 1e-05,
"max_position_embeddings": 514,
"model_type": "roberta",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 1,
"type_vocab_size": 1,
"vocab_size": 50265
}
08/31/2021 19:09:30 - INFO - transformers.configuration_utils - loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-config.json from cache at /u/kaiyuy/.cache/torch/transformers/e1a2a406b5a05063c31f4dfdee7608986ba7c6393f7f79db5e69dcd197208534.117c81977c5979de8c088352e74ec6e70f5c66096c28b61d3c50101609b39690
08/31/2021 19:09:30 - INFO - transformers.configuration_utils - Model config RobertaConfig {
"architectures": [
"RobertaForMaskedLM"
],
"attention_probs_dropout_prob": 0.1,
"bos_token_id": 0,
"eos_token_id": 2,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"layer_norm_eps": 1e-05,
"max_position_embeddings": 514,
"model_type": "roberta",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 1,
"type_vocab_size": 1,
"vocab_size": 50265
}
08/31/2021 19:09:30 - INFO - transformers.modeling_utils - loading weights file https://cdn.huggingface.co/roberta-base-pytorch_model.bin from cache at /u/kaiyuy/.cache/torch/transformers/80b4a484eddeb259bec2f06a6f2f05d90934111628e0e1c09a33bd4a121358e1.49b88ba7ec2c26a7558dda98ca3884c3b80fa31cf43a1b1f23aef3ff81ba344e
08/31/2021 19:09:33 - INFO - transformers.configuration_utils - loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-config.json from cache at /u/kaiyuy/.cache/torch/transformers/e1a2a406b5a05063c31f4dfdee7608986ba7c6393f7f79db5e69dcd197208534.117c81977c5979de8c088352e74ec6e70f5c66096c28b61d3c50101609b39690
08/31/2021 19:09:33 - INFO - transformers.configuration_utils - Model config RobertaConfig {
"architectures": [
"RobertaForMaskedLM"
],
"attention_probs_dropout_prob": 0.1,
"bos_token_id": 0,
"eos_token_id": 2,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"layer_norm_eps": 1e-05,
"max_position_embeddings": 514,
"model_type": "roberta",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 1,
"type_vocab_size": 1,
"vocab_size": 50265
}
08/31/2021 19:09:33 - INFO - transformers.tokenization_utils - loading file https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-vocab.json from cache at /u/kaiyuy/.cache/torch/transformers/d0c5776499adc1ded22493fae699da0971c1ee4c2587111707a4d177d20257a2.ef00af9e673c7160b4d41cfda1f48c5f4cba57d5142754525572a846a1ab1b9b
08/31/2021 19:09:33 - INFO - transformers.tokenization_utils - loading file https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-merges.txt from cache at /u/kaiyuy/.cache/torch/transformers/b35e7cd126cd4229a746b5d5c29a749e8e84438b14bcdb575950584fe33207e8.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda
/u/kaiyuy/projects/miniconda3/envs/MDR/lib/python3.6/site-packages/torch/utils/data/dataloader.py:481: UserWarning: This DataLoader will create 30 worker processes in total. Our suggested max number of worker in current system is 16, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
cpuset_checked))
08/31/2021 19:09:34 - INFO - __main__ - Num of dev batches: 3
08/31/2021 19:09:44 - INFO - __main__ - Start training....
output directory ./logs/08-31-2021/test_run-seed16-bsz100-fp16True-lr2e-05-decay0.0-warm0.1-valbsz3000-sharedTrue-multi1-schemenone already exists and is not empty.
Loading data from /u/kaiyuy/pvl-mathqa/multihop_dense_retrieval/data/hotpot/hotpot_dev_with_neg_v0.json
Total sample count 7405
number of trainable parameters: 125237760
Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.
Defaults for this optimization level are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback. Original ImportError was: ModuleNotFoundError("No module named 'amp_C'",)
Loading data from /u/kaiyuy/pvl-mathqa/multihop_dense_retrieval/data/hotpot/hotpot_train_with_neg_v0.json
Total sample count 90447
0%| | 0/905 [00:00<?, ?it/s]scripts/train_mhop.py:184: FutureWarning: Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required to retain the old behavior.
amp.master_params(optimizer), args.max_grad_norm)
/u/kaiyuy/projects/miniconda3/envs/MDR/lib/python3.6/site-packages/torch/optim/lr_scheduler.py:134: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
7%|▋ | 60/905 [03:36<50:42, 3.60s/it]
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8192.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2048.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1024.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 512.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 256.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 128.0
Traceback (most recent call last):
File "scripts/train_mhop.py", line 254, in <module>
main()
File "scripts/train_mhop.py", line 167, in main
for batch in tqdm(train_dataloader):
File "/u/kaiyuy/projects/miniconda3/envs/MDR/lib/python3.6/site-packages/tqdm/std.py", line 1185, in __iter__
for obj in iterable:
File "/u/kaiyuy/projects/miniconda3/envs/MDR/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
data = self._next_data()
File "/u/kaiyuy/projects/miniconda3/envs/MDR/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1183, in _next_data
return self._process_data(data)
File "/u/kaiyuy/projects/miniconda3/envs/MDR/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
data.reraise()
File "/u/kaiyuy/projects/miniconda3/envs/MDR/lib/python3.6/site-packages/torch/_utils.py", line 425, in reraise
raise self.exc_type(msg)
IndexError: Caught IndexError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/u/kaiyuy/projects/miniconda3/envs/MDR/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
data = fetcher.fetch(index)
File "/u/kaiyuy/projects/miniconda3/envs/MDR/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/u/kaiyuy/projects/miniconda3/envs/MDR/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/u/kaiyuy/pvl-mathqa/multihop_dense_retrieval/mdr/retrieval/data/mhop_dataset.py", line 64, in __getitem__
neg_codes_2 = self.encode_para(sample["neg_paras"][1], self.max_c_len)
IndexError: list index out of range
Hm, from the error it looks like there are some samples that have less than two negative paragraphs, did you maybe also accidentally delete these two lines when you deleted the pdb lines?
if train:
self.data = [_ for _ in self.data if len(_["neg_paras"]) >= 2]
If you check the hotpot_train_with_neg_v0.json
file there are two samples that have only 1 negative paragraph, so these have to be removed through the snippet above.
Yes, exactly! I removed the entire if train:
block. Looks like it's working now. Thank you!
Great, no worries!
Hello!
Thanks for sharing your work. While running the model I found a few issues:
eval_mhop_retrieval.py
, then process the output file it produces usingadd_sp_label.sh
and finally run that throughtrain_qa.py
(with thedo_predict
flag). However,add_sp_label.sh
callsmhop_utils.py
which requires the filetitle2sents.txt
, which seems to be missing. I was able to write my own script to process the retriever output, but figured I'd flag this with youIn
mdr/retrieval/data/mhop_dataset.py
on line 32 it looks like pdb was accidentally left activated, I think it should be removed. On line 34 there's also a "TODO: remove for final release" comment so there might be some other code that needs to be removed :)train_qa.py
but I was getting score values that were close to 0. I realised that these lines (330-331) could be wrong:It looks like
id2gold[qid][0]
is taking only the first character of the answer. What worked for me is replacing these with:eval_mhop_retrieval.py
with only 1 GPU, as the linerequires at least 7 I believe. I had to change it to:
It might be nice if the script automatically used the right gpu depending on how many you have.
Thanks!