A few issues - Githubissues

ChiaraMC commented 3 years ago

Hello!

Thanks for sharing your work. While running the model I found a few issues:

If I understand correctly, I should first run the retriever with eval_mhop_retrieval.py , then process the output file it produces using add_sp_label.sh and finally run that through train_qa.py (with the do_predict flag). However, add_sp_label.sh calls mhop_utils.py which requires the file title2sents.txt , which seems to be missing. I was able to write my own script to process the retriever output, but figured I'd flag this with you

In mdr/retrieval/data/mhop_dataset.py on line 32 it looks like pdb was accidentally left activated, I think it should be removed. On line 34 there's also a "TODO: remove for final release" comment so there might be some other code that needs to be removed :)

    if train:

        import pdb; pdb.set_trace()

        # debug TODO: remove for final release
        for idx in range(len(self.data)):
            self.data[idx]["neg_paras"] = self.data[idx]["tfidf_neg"]

        self.data = [_ for _ in self.data if len(_["neg_paras"]) >= 2]
    print(f"Total sample count {len(self.data)}")

I was trying to run the QA evaluation through train_qa.py but I was getting score values that were close to 0. I realised that these lines (330-331) could be wrong:
```
ems.append(exact_match_score(top_pred, id2gold[qid][0]))
f1, prec, recall = f1_score(top_pred, id2gold[qid][0])
```

It looks like id2gold[qid][0] is taking only the first character of the answer. What worked for me is replacing these with:

ems.append(exact_match_score(top_pred, id2gold[qid]))
f1, prec, recall = f1_score(top_pred, id2gold[qid])

There's an issue with running eval_mhop_retrieval.py with only 1 GPU, as the line
```
index = faiss.index_cpu_to_gpu(res, 6, index)
```
requires at least 7 I believe. I had to change it to:
```
index = faiss.index_cpu_to_gpu(res, 0, index)
```
It might be nice if the script automatically used the right gpu depending on how many you have.

Thanks!

xwhan commented 2 years ago

Hi @ChiaraMC, thank you for spotting the bugs. Can you send a pull request

yangky11 commented 2 years ago

Hi,

Any update on this issue? I'm also having issue 2 but not sure what code to remove. Thanks!

ChiaraMC commented 2 years ago

@yangky11 I think you can just go ahead and remove these lines (that's what I did and have had no issues)

            import pdb; pdb.set_trace()

            # debug TODO: remove for final release
            for idx in range(len(self.data)):
                self.data[idx]["neg_paras"] = self.data[idx]["tfidf_neg"]

yangky11 commented 2 years ago

@ChiaraMC Hi Chiara,

Thanks for your suggestion! I tried removing these 3 lines but get the following error. It looks like neg_paras should be set somewhere.

08/31/2021 19:09:29 - INFO - __main__ - Namespace(accumulate_gradients=1, adam_epsilon=1e-08, do_predict=False, do_train=True, eval_period=-1, fp16=True, fp16_opt_level='O1', gradient_accumulation_steps=1, init_checkpoint='', init_retriever='', iterations_per_loop=1000, k=38400, learning_rate=2e-05, local_rank=-1, m=0.999, max_c_len=300, max_grad_norm=2.0, max_q_len=70, max_q_sp_len=350, model_name='roberta-base', momentum=False, multi_vector=1, no_cuda=False, nq_multi=False, num_train_epochs=50, num_workers=30, output_dir='./logs/08-31-2021/test_run-seed16-bsz100-fp16True-lr2e-05-decay0.0-warm0.1-valbsz3000-sharedTrue-multi1-schemenone', predict_batch_size=3000, predict_file='/u/kaiyuy/pvl-mathqa/multihop_dense_retrieval/data/hotpot/hotpot_dev_with_neg_v0.json', prefix='test_run', rnn_retriever=False, save_checkpoints_steps=20000, scheme='none', seed=16, sent_level=False, shared_encoder=True, stop_drop=0, temperature=1, train_batch_size=100, train_file='/u/kaiyuy/pvl-mathqa/multihop_dense_retrieval/data/hotpot/hotpot_train_with_neg_v0.json', use_adam=False, warmup_ratio=0.1, weight_decay=0.0)
08/31/2021 19:09:30 - INFO - __main__ - device cuda n_gpu 8 distributed training False
08/31/2021 19:09:30 - INFO - transformers.configuration_utils - loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-config.json from cache at /u/kaiyuy/.cache/torch/transformers/e1a2a406b5a05063c31f4dfdee7608986ba7c6393f7f79db5e69dcd197208534.117c81977c5979de8c088352e74ec6e70f5c66096c28b61d3c50101609b39690
08/31/2021 19:09:30 - INFO - transformers.configuration_utils - Model config RobertaConfig {
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "type_vocab_size": 1,
  "vocab_size": 50265
}

08/31/2021 19:09:30 - INFO - transformers.configuration_utils - loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-config.json from cache at /u/kaiyuy/.cache/torch/transformers/e1a2a406b5a05063c31f4dfdee7608986ba7c6393f7f79db5e69dcd197208534.117c81977c5979de8c088352e74ec6e70f5c66096c28b61d3c50101609b39690
08/31/2021 19:09:30 - INFO - transformers.configuration_utils - Model config RobertaConfig {
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "type_vocab_size": 1,
  "vocab_size": 50265
}

08/31/2021 19:09:30 - INFO - transformers.modeling_utils - loading weights file https://cdn.huggingface.co/roberta-base-pytorch_model.bin from cache at /u/kaiyuy/.cache/torch/transformers/80b4a484eddeb259bec2f06a6f2f05d90934111628e0e1c09a33bd4a121358e1.49b88ba7ec2c26a7558dda98ca3884c3b80fa31cf43a1b1f23aef3ff81ba344e
08/31/2021 19:09:33 - INFO - transformers.configuration_utils - loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-config.json from cache at /u/kaiyuy/.cache/torch/transformers/e1a2a406b5a05063c31f4dfdee7608986ba7c6393f7f79db5e69dcd197208534.117c81977c5979de8c088352e74ec6e70f5c66096c28b61d3c50101609b39690
08/31/2021 19:09:33 - INFO - transformers.configuration_utils - Model config RobertaConfig {
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "type_vocab_size": 1,
  "vocab_size": 50265
}

08/31/2021 19:09:33 - INFO - transformers.tokenization_utils - loading file https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-vocab.json from cache at /u/kaiyuy/.cache/torch/transformers/d0c5776499adc1ded22493fae699da0971c1ee4c2587111707a4d177d20257a2.ef00af9e673c7160b4d41cfda1f48c5f4cba57d5142754525572a846a1ab1b9b
08/31/2021 19:09:33 - INFO - transformers.tokenization_utils - loading file https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-merges.txt from cache at /u/kaiyuy/.cache/torch/transformers/b35e7cd126cd4229a746b5d5c29a749e8e84438b14bcdb575950584fe33207e8.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda
/u/kaiyuy/projects/miniconda3/envs/MDR/lib/python3.6/site-packages/torch/utils/data/dataloader.py:481: UserWarning: This DataLoader will create 30 worker processes in total. Our suggested max number of worker in current system is 16, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  cpuset_checked))
08/31/2021 19:09:34 - INFO - __main__ - Num of dev batches: 3
08/31/2021 19:09:44 - INFO - __main__ - Start training....
output directory ./logs/08-31-2021/test_run-seed16-bsz100-fp16True-lr2e-05-decay0.0-warm0.1-valbsz3000-sharedTrue-multi1-schemenone already exists and is not empty.
Loading data from /u/kaiyuy/pvl-mathqa/multihop_dense_retrieval/data/hotpot/hotpot_dev_with_neg_v0.json
Total sample count 7405
number of trainable parameters: 125237760
Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Warning:  multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback.  Original ImportError was: ModuleNotFoundError("No module named 'amp_C'",)
Loading data from /u/kaiyuy/pvl-mathqa/multihop_dense_retrieval/data/hotpot/hotpot_train_with_neg_v0.json
Total sample count 90447
  0%|          | 0/905 [00:00<?, ?it/s]scripts/train_mhop.py:184: FutureWarning: Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required to retain the old behavior.
  amp.master_params(optimizer), args.max_grad_norm)
/u/kaiyuy/projects/miniconda3/envs/MDR/lib/python3.6/site-packages/torch/optim/lr_scheduler.py:134: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
  7%|▋         | 60/905 [03:36<50:42,  3.60s/it]  
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 32768.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 16384.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8192.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2048.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1024.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 512.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 256.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 128.0
Traceback (most recent call last):
  File "scripts/train_mhop.py", line 254, in <module>
    main()
  File "scripts/train_mhop.py", line 167, in main
    for batch in tqdm(train_dataloader):
  File "/u/kaiyuy/projects/miniconda3/envs/MDR/lib/python3.6/site-packages/tqdm/std.py", line 1185, in __iter__
    for obj in iterable:
  File "/u/kaiyuy/projects/miniconda3/envs/MDR/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()
  File "/u/kaiyuy/projects/miniconda3/envs/MDR/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1183, in _next_data
    return self._process_data(data)
  File "/u/kaiyuy/projects/miniconda3/envs/MDR/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
    data.reraise()
  File "/u/kaiyuy/projects/miniconda3/envs/MDR/lib/python3.6/site-packages/torch/_utils.py", line 425, in reraise
    raise self.exc_type(msg)
IndexError: Caught IndexError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/u/kaiyuy/projects/miniconda3/envs/MDR/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/u/kaiyuy/projects/miniconda3/envs/MDR/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/u/kaiyuy/projects/miniconda3/envs/MDR/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/u/kaiyuy/pvl-mathqa/multihop_dense_retrieval/mdr/retrieval/data/mhop_dataset.py", line 64, in __getitem__
    neg_codes_2 = self.encode_para(sample["neg_paras"][1], self.max_c_len)
IndexError: list index out of range

ChiaraMC commented 2 years ago

Hm, from the error it looks like there are some samples that have less than two negative paragraphs, did you maybe also accidentally delete these two lines when you deleted the pdb lines?

        if train:
            self.data = [_ for _ in self.data if len(_["neg_paras"]) >= 2]

If you check the hotpot_train_with_neg_v0.json file there are two samples that have only 1 negative paragraph, so these have to be removed through the snippet above.

yangky11 commented 2 years ago

Yes, exactly! I removed the entire if train: block. Looks like it's working now. Thank you!

ChiaraMC commented 2 years ago

Great, no worries!

facebookresearch / multihop_dense_retrieval

A few issues #13