Open jack-lac opened 4 months ago
Facing the same issue, any help would be much appreciated.
I get the following when I force stop it:
Traceback (most recent call last):
File "/home/user/PycharmProjects/neural_doc_ranking/src/cadr/RAGatouille/train.py", line 102, in <module>
main()
File "/home/user/PycharmProjects/neural_doc_ranking/src/cadr/RAGatouille/train.py", line 83, in main
train(
File "/home/user/PycharmProjects/neural_doc_ranking/src/cadr/RAGatouille/train.py", line 51, in train
return model.train(data_dir=data_dir, training_config=training_config)
File "/home/user/anaconda3/envs/ikat-env/lib/python3.10/site-packages/ragatouille/models/colbert.py", line 452, in train
trainer.train(checkpoint=self.checkpoint)
File "/home/user/anaconda3/envs/ikat-env/lib/python3.10/site-packages/colbert/trainer.py", line 31, in train
self._best_checkpoint_path = launcher.launch(self.config, self.triples, self.queries, self.collection)
File "/home/user/anaconda3/envs/ikat-env/lib/python3.10/site-packages/colbert/infra/launcher.py", line 72, in launch
return_values = sorted([return_value_queue.get() for _ in all_procs])
File "/home/user/anaconda3/envs/ikat-env/lib/python3.10/site-packages/colbert/infra/launcher.py", line 72, in <listcomp>
return_values = sorted([return_value_queue.get() for _ in all_procs])
File "/home/user/anaconda3/envs/ikat-env/lib/python3.10/multiprocessing/queues.py", line 103, in get
res = self._recv_bytes()
File "/home/user/anaconda3/envs/ikat-env/lib/python3.10/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/home/user/anaconda3/envs/ikat-env/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
buf = self._recv(4)
File "/home/user/anaconda3/envs/ikat-env/lib/python3.10/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
Hey! This seems to be a multiprocessing issue. I'll try and figure out what it is, but this can often be finnicky with the upstream ColBERT lib 🤔
@shubham526 are you making sure to run your code within if __name__ == "__main__":
? In both cases, the problem appears to be that the individual processes spun up by colbert-ai
's multiprocessing handler don't actually ever start.
@bclavie Yes, I am running inside if __name__ == "__main__":
Could you share some more information about your environment then? This could maybe help point in the right direction! (torch version, cuda version, python version, pip freeze/overall environment (Ubuntu?))
I have attached the output of pip freeze
.
I tried this on another server and it worked. No idea why though.
Faced the same issue.
@jack-lac have you managed to resolve the problem? @shubham526 do you run you source code on Google Colab?
Also facing the same issue on Google Colab. I thought setting something along the lines of
trainer.model.config.nranks = 1
trainer.model.config.gpus = 1
would solve it, but it's not working either (stuck at "Starting"). It's totally upstream in Colbert's training code, and may be due to Colab being slightly more closed in terms of permissions. Skimming over the code, I noticed that it needs to open some ports for communication between worker processes.
I don't think wrapping the Colab code in an if __name__ == '__main__'
would solve this.
edit: this issue could be interesting for further debugging, maybe it's data-related.
Hi ! I tried to execute the basic_training notebook on Google collab. The
trainer.train()
phase is stuck in something an infinite loop for hours after printing only "#> Starting..." Here is the code that is displayed when I interrupt the execution :