Closed nikhilno1 closed 4 years ago
Hi! thanks for your contribution!, great first issue!
Is it the case that training is actually completing but command doesn't return, which is what I am used to seeing?
I am seeing the pytorch_model.bin getting created so which means training was successful. Closing the issue.
🐛 Bug
I am training GPT2 model on TPU but training is getting stuck with following as the last line: tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:54541
To Reproduce
I have followed all steps as outlined in https://github.com/mgrankin/ru_transformers/tree/master/tpu to train a GPT2 model on TPU on Google Cloud. As mentioned there, I was able to successfully run MNIST example without any issue
python /pytorch/xla/test/test_train_mp_mnist.py
But when I ran the full training which is on a small dataset (10MB) just to make sure it runs successfully, the training is getting stuck with above line and doesn't proceed further. When I press Ctrl-C, I can see it is waiting in socket polling. I have tried restarting the TPU but same problem is observed.Steps to reproduce the behavior:
Logs
TPU Hang.log
Expected behavior
Training should complete successfully.
Environment