Impact of Updating Evidence Embeddings

jinzhuoran commented 1 year ago

Hi, thanks for your excellent work. I meet a problem with rebuilding evidence embeddings during the training stage. Batch 39000 | Total 19968000 Batch 40000 | Total 20480000 Batch 41000 | Total 20992000 [E ProcessGroupNCCL.cpp:719] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1007, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801009 milliseconds befotiming out. [E ProcessGroupNCCL.cpp:719] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1007, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801846 milliseconds befotiming out. [E ProcessGroupNCCL.cpp:719] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1007, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800915 milliseconds befotiming out. [E ProcessGroupNCCL.cpp:406] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomte data. To avoid this inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:406] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomte data. To avoid this inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:406] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomte data. To avoid this inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' terminate called after throwing an instance of 'std::runtime_error' what(): terminate called after throwing an instance of '[Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1007, OpType=ALLREDUCE, Timeout(ms)=1800000) ranr 1801009 milliseconds before timing out.std::runtime_error This looks like a timeout caused by building the index. How can I solve this problem? What is the performance impact if I don't rebuild evidence embeddings during the training stage?

DevSinghSachan commented 1 year ago

Hi! Thanks for reaching out.

Can you share how frequent is the timeout? Does it happen every time the document embeddings are computed or once in a while? If timeout is a persistent issue, see if you can re-install CUDA package along with drivers and try again.

You can also increase the step numbers for embedding creation, say from 500 to 1000 or 1500. It won't affect training much. Generally, when you evaluate the top-k accuracy after certain time interval, it is important to have fresh embeddings.

jinzhuoran commented 1 year ago

Thanks for your reply! I think the following problem seems to be caused by insufficient memory. what is the memory of your machine?

Traceback (most recent call last):
  File "tasks/run.py", line 45, in <module>

  File "/home/zhuoran/code/art/tasks/dense_retriever/zero_shot_training/run.py", line 60, in main
    zero_shot_retriever(dataset_cls)
  File "/home/zhuoran/code/art/tasks/dense_retriever/zero_shot_training/run.py", line 45, in zero_shot_retriever
    train(train_dataset_provider, model_provider)
  File "/home/zhuoran/code/art/tasks/dense_retriever/zero_shot_training/train.py", line 336, in train
    train_dataloader)
  File "/home/zhuoran/code/art/tasks/dense_retriever/zero_shot_training/train.py", line 250, in _train
    call_evidence_index_builder()
  File "/home/zhuoran/code/art/tasks/dense_retriever/zero_shot_training/train.py", line 199, in call_evidence_index_builder
    index_builder.build_and_save_index()
  File "/home/zhuoran/code/art/megatron/indexer.py", line 162, in build_and_save_index
    self.evidence_embedder_obj.clear()
  File "/home/zhuoran/code/art/megatron/data/art_index.py", line 96, in merge_shards_and_save
    pickle.dump(self.state(), final_file)
MemoryError

This step seems to just save a new Wikipedia embedding, can I remove this step?

DevSinghSachan commented 1 year ago

Actually, my compute instances had a relatively high disk memory such as 2TB - 5TB, CPU memory was in the range of 500GB - 1.3 TB, and GPU memory was dependent on either the A100 GPUs (40 / 80 GB) or A6000 ones (48GB).

If disk storage is an issue (which seems like from this error), then you can try to clean up the past checkpoints, as the code's default setting saves retriever state every 500 steps I think.

This step is important as every worker saves it's shard of embeddings to disk after which the master worker merges all the shards into a single file. Then, the new file is read back again and sharded into all GPUs.

You can work around this though as you can just refresh the document embeddings of a particular GPU internally and can get around the embedding saving step. But it's not currently implemented by default and you will have to implement it.

DevSinghSachan / art

Impact of Updating Evidence Embeddings #10