NCCL libary runtime error

Hi,

I tried to run your code by following the readme instructions.

When I tried the main experiment [ColBERT Retrieval], and run the script:

python src/main.py configs/nq_tables/colbert.jsonnet --accelerator gpu --devices 2 --strategy ddp --num_sanity_val_steps 2 --experiment_name ColBERT_NQTables_bz4_negative4_fix_doclen_full_search_NewcrossGPU --mode train --override --opts train.batch_size=6 train.scheduler=None train.epochs=1000 train.lr=0.00001 train.additional.gradient_accumulation_steps=4 train.additional.warmup_steps=0 train.additional.early_stop_patience=10 train.additional.save_top_k=3 valid.batch_size=32 test.batch_size=32 valid.step_size=200 data_loader.dummy_dataloader=0 reset=1 model_config.num_negative_samples=4 model_config.bm25_top_k=5 model_config.bm25_ratio=0 model_config.nbits=2

The only difference with your script is that I change the device number to 2. I have two A6000 GPUs.

However, I got an runtime error:

Traceback (most recent call last): File "/home/wan167/work/table_NLP/robust-tableqa/env/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/home/wan167/work/table_NLP/robust-tableqa/env/lib/python3.8/multiprocessing/process.py", line 108, in run self._target(*self._args, *self._kwargs) File "/home/wan167/work/table_NLP/robust-tableqa/src/ColBERT/colbert/infra/launcher.py", line 115, in setup_new_process return_val = callee(config, args) File "/home/wan167/work/table_NLP/robust-tableqa/src/ColBERT/colbert/indexing/collection_indexer.py", line 33, in encode encoder.run(shared_lists) File "/home/wan167/work/table_NLP/robust-tableqa/src/ColBERT/colbert/indexing/collection_indexer.py", line 58, in run self.setup() File "/home/wan167/work/table_NLP/robust-tableqa/src/ColBERT/colbert/indexing/collection_indexer.py", line 88, in setup avg_doclen_est = self._sample_embeddings(sampled_pids) File "/home/wan167/work/table_NLP/robust-tableqa/src/ColBERT/colbert/indexing/collection_indexer.py", line 126, in _sample_embeddings torch.distributed.all_reduce(self.num_sample_embs) File "/home/wan167/work/table_NLP/robust-tableqa/env/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1285, in all_reduce work = default_pg.allreduce([tensor], opts) RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1639180588308/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3 ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

amazon-science / robust-tableqa

NCCL libary runtime error #4