Trainer stuck - Githubissues

jack-lac commented 4 months ago

Hi ! I tried to execute the basic_training notebook on Google collab. The trainer.train() phase is stuck in something an infinite loop for hours after printing only "#> Starting..." Here is the code that is displayed when I interrupt the execution :

[/usr/local/lib/python3.10/dist-packages/ragatouille/RAGTrainer.py](https://localhost:8080/#) in train(self, batch_size, nbits, maxsteps, use_ib_negatives, learning_rate, dim, doc_maxlen, use_relu, warmup_steps, accumsteps)
    236         )
    237 
--> 238         return self.model.train(data_dir=self.data_dir, training_config=training_config)

[/usr/local/lib/python3.10/dist-packages/ragatouille/models/colbert.py](https://localhost:8080/#) in train(self, data_dir, training_config)
    450             )
    451 
--> 452             trainer.train(checkpoint=self.checkpoint)
    453 
    454     def _colbert_score(self, Q, D_padded, D_mask):

[/usr/local/lib/python3.10/dist-packages/colbert/trainer.py](https://localhost:8080/#) in train(self, checkpoint)
     29         launcher = Launcher(train)
     30 
---> 31         self._best_checkpoint_path = launcher.launch(self.config, self.triples, self.queries, self.collection)
     32 
     33 

[/usr/local/lib/python3.10/dist-packages/colbert/infra/launcher.py](https://localhost:8080/#) in launch(self, custom_config, *args)
     70         # TODO: If the processes crash upon join, raise an exception and don't block on .get() below!
     71 
---> 72         return_values = sorted([return_value_queue.get() for _ in all_procs])
     73         return_values = [val for rank, val in return_values]
     74 

[/usr/local/lib/python3.10/dist-packages/colbert/infra/launcher.py](https://localhost:8080/#) in <listcomp>(.0)
     70         # TODO: If the processes crash upon join, raise an exception and don't block on .get() below!
     71 
---> 72         return_values = sorted([return_value_queue.get() for _ in all_procs])
     73         return_values = [val for rank, val in return_values]
     74 

[/usr/lib/python3.10/multiprocessing/queues.py](https://localhost:8080/#) in get(self, block, timeout)
    101         if block and timeout is None:
    102             with self._rlock:
--> 103                 res = self._recv_bytes()
    104             self._sem.release()
    105         else:

[/usr/lib/python3.10/multiprocessing/connection.py](https://localhost:8080/#) in recv_bytes(self, maxlength)
    214         if maxlength is not None and maxlength < 0:
    215             raise ValueError("negative maxlength")
--> 216         buf = self._recv_bytes(maxlength)
    217         if buf is None:
    218             self._bad_message_length()

[/usr/lib/python3.10/multiprocessing/connection.py](https://localhost:8080/#) in _recv_bytes(self, maxsize)
    412 
    413     def _recv_bytes(self, maxsize=None):
--> 414         buf = self._recv(4)
    415         size, = struct.unpack("!i", buf.getvalue())
    416         if size == -1:

[/usr/lib/python3.10/multiprocessing/connection.py](https://localhost:8080/#) in _recv(self, size, read)
    377         remaining = size
    378         while remaining > 0:
--> 379             chunk = read(handle, remaining)
    380             n = len(chunk)
    381             if n == 0:

shubham526 commented 4 months ago

Facing the same issue, any help would be much appreciated.

I get the following when I force stop it:

Traceback (most recent call last):
  File "/home/user/PycharmProjects/neural_doc_ranking/src/cadr/RAGatouille/train.py", line 102, in <module>
    main()
  File "/home/user/PycharmProjects/neural_doc_ranking/src/cadr/RAGatouille/train.py", line 83, in main
    train(
  File "/home/user/PycharmProjects/neural_doc_ranking/src/cadr/RAGatouille/train.py", line 51, in train
    return model.train(data_dir=data_dir, training_config=training_config)
  File "/home/user/anaconda3/envs/ikat-env/lib/python3.10/site-packages/ragatouille/models/colbert.py", line 452, in train
    trainer.train(checkpoint=self.checkpoint)
  File "/home/user/anaconda3/envs/ikat-env/lib/python3.10/site-packages/colbert/trainer.py", line 31, in train
    self._best_checkpoint_path = launcher.launch(self.config, self.triples, self.queries, self.collection)
  File "/home/user/anaconda3/envs/ikat-env/lib/python3.10/site-packages/colbert/infra/launcher.py", line 72, in launch
    return_values = sorted([return_value_queue.get() for _ in all_procs])
  File "/home/user/anaconda3/envs/ikat-env/lib/python3.10/site-packages/colbert/infra/launcher.py", line 72, in <listcomp>
    return_values = sorted([return_value_queue.get() for _ in all_procs])
  File "/home/user/anaconda3/envs/ikat-env/lib/python3.10/multiprocessing/queues.py", line 103, in get
    res = self._recv_bytes()
  File "/home/user/anaconda3/envs/ikat-env/lib/python3.10/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/home/user/anaconda3/envs/ikat-env/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/home/user/anaconda3/envs/ikat-env/lib/python3.10/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)

bclavie commented 4 months ago

Hey! This seems to be a multiprocessing issue. I'll try and figure out what it is, but this can often be finnicky with the upstream ColBERT lib 🤔

@shubham526 are you making sure to run your code within if __name__ == "__main__":? In both cases, the problem appears to be that the individual processes spun up by colbert-ai's multiprocessing handler don't actually ever start.

shubham526 commented 4 months ago

@bclavie Yes, I am running inside if __name__ == "__main__":

bclavie commented 4 months ago

Could you share some more information about your environment then? This could maybe help point in the right direction! (torch version, cuda version, python version, pip freeze/overall environment (Ubuntu?))

shubham526 commented 4 months ago

requirements.txt

I have attached the output of pip freeze.

OS: Ubuntu 22.04 LTS.
Python version: '3.10.10 (main, Mar 21 2023, 18:45:11) [GCC 11.2.0]'
PyTorch version: 2.0.1+cu117
CUDA version: 11.7

shubham526 commented 4 months ago

I tried this on another server and it worked. No idea why though.

localcitizen commented 3 months ago

Faced the same issue.

@jack-lac have you managed to resolve the problem? @shubham526 do you run you source code on Google Colab?

viantirreau commented 2 months ago

Also facing the same issue on Google Colab. I thought setting something along the lines of

trainer.model.config.nranks = 1
trainer.model.config.gpus = 1

would solve it, but it's not working either (stuck at "Starting"). It's totally upstream in Colbert's training code, and may be due to Colab being slightly more closed in terms of permissions. Skimming over the code, I noticed that it needs to open some ports for communication between worker processes. I don't think wrapping the Colab code in an if __name__ == '__main__' would solve this.

edit: this issue could be interesting for further debugging, maybe it's data-related.

AnswerDotAI / RAGatouille

Trainer stuck #210