IBM / multidoc2dial

MultiDoc2Dial: Modeling Dialogues Grounded in Multiple Documents
Apache License 2.0
67 stars 22 forks source link

Question about using multiple gpus #11

Open YunahJang opened 2 years ago

YunahJang commented 2 years ago

Hi! I'm having some trouble using multiple gpus for run_finetune_rag_dialdoc.sh file.

I have set --gpus parameter as 4 but i kept getting errors as below.

ValueError: ProcessGroupGloo::scatter: invalid tensor type at index 0 (expected TensorOptions(dtype=double, device=cpu, layout=Strided, requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)), got TensorOptions(dtype=float, device=cpu, layout=Strided, requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))

So I have modified a line 159 in dialdoc/models/rag/distributed_pytorch_retriever.py file by not specifying target_type variable. retrieved_doc_embeds = self._scattered(scatter_vectors, [n_queries, n_docs, combined_hidden_states.shape[1]])`

After this modification, i am getting errors as below and I couldn't figure out why I'm getting this error.

File "/home/yunah/multidoc2dial_ours/dialdoc/models/rag/distributed_pytorch_retriever.py", line 157, in retrieve doc_ids = self._scattered(scatter_ids, [n_queries, n_docs], target_type=torch.int64) File "/home/yunah/multidoc2dial_ours/dialdoc/models/rag/distributed_pytorch_retriever.py", line 82, in _scattered dist.scatter(target_tensor, src=0, scatter_list=scatter_list, group=self.process_group) File "/home/yunah/.conda/envs/multidoc2dial/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2191, in scatter work = group.scatter(output_tensors, input_tensors, opts) ValueError: ProcessGroupGloo::scatter: Incorrect input list size 1. Input list size should be 2, same as size of the process group.

Did I miss any other variables or settings I should change before using multiple gpus? I would like to know if there is a solution for this error. Thanks a lot!

Best, Yunah