MSMARCO python index error when reranking

nimasadri11 commented 3 years ago

I am trying to replicate the MSMARCO results using the feature/msmarco_psg branch. However, I get the below error. It seems to be related to #52. Has anyone else encountered this?

validation: 22110it [5:06:29,  1.21it/s]
validation: 22110it [5:06:31,  1.20it/s]
Traceback (most recent call last):
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.7.9/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/python/3.7.9/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/scratch/nsadri/capreolus/capreolus/run.py", line 108, in <module>
    task_entry_function()
  File "/scratch/nsadri/capreolus/capreolus/task/rerank.py", line 48, in train
    return self.rerank_run(best_search_run, self.get_results_path())
  File "/scratch/nsadri/capreolus/capreolus/task/rerank.py", line 95, in rerank_run
    self.benchmark.relevance_level,
  File "/scratch/nsadri/capreolus/capreolus/trainer/tensorflow.py", line 294, in train
    trec_preds = self.get_preds_in_trec_format(dev_predictions, dev_data)
  File "/scratch/nsadri/capreolus/capreolus/trainer/tensorflow.py", line 517, in get_preds_in_trec_format
    pred_dict[qid][docid] = predictions[i].numpy().astype(np.float16).item()
IndexError: list index out of range
0%|          | 0/1875 [5:09:52<?, ?it/s]

crystina-z commented 3 years ago

@nimasadri11 This error seems to indicate the number of generated predictions entries (qid, docid pairs) is less than expected. I wonder if the .tfrecord preparation on dev set was somehow interrupted?

nimasadri11 commented 3 years ago

@crystina-z gotcha! Thank you! I'm testing again, and hopefully close this issue in the next 36 hours if I don't face the same issue again : )

nimasadri11 commented 3 years ago

@crystina-z I was training this and it got stuck on .tfrecord number 42. The process was still running but wasn't generating any logs. Also, isn't MSMARCO supposed to take max 20 hours with these resources and parameters? Mine took over 36 hours and didn't finish.

#SBATCH --nodes=1
#SBATCH --gres=gpu:v100l:4
#SBATCH --ntasks-per-node=1
#SBATCH --mem=48GB
#SBATCH --cpus-per-task=8
export CUDA_AVAILABLE_DEVICES=0,1,2,3

configs from config_msmarco.txt and

lr=1e-3
bertlr=2e-5
itersize=30000
#itersize=30
warmupsteps=3000
#warmupsteps=3
decaystep=$itersize  # either same with $itersize or 0
decaytype=linear

crystina-z commented 3 years ago

@nimasadri11 if the .tfrecord is named using number 1 to 300+, it should be .tfrecord for dev set and training was not really started. As for the reason it get stuck, any possibly that the disk if full and no more writing could be added to the log file? this would possibly happen if CAPREOLUS_CACHE and CAPREOLUS_RESULTS r unset so all files r stored under ~/.capreolus.

Tho I found it on CC the disk access speed is somewhat slow-ish. If u wanna quickly check if any settings goes wrong, then we can rerank only the top100 rather than top1k documents, by adding threshold=100. This would reduced a lot data preparation and inference time. And for additional speed up, could try to turn mix precision on by capreolus.reranker.trainer.amp=True.

nimasadri11 commented 3 years ago

Awesome! I will give this a try and close this issue. Thank you @crystina-z

capreolus-ir / capreolus

MSMARCO python index error when reranking #170