Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.01k stars 3.36k forks source link

Training on TPU stuck at "Waiting to connect to client mesh master (300 seconds) localhost:54541" #1090

Closed nikhilno1 closed 4 years ago

nikhilno1 commented 4 years ago

🐛 Bug

I am training GPT2 model on TPU but training is getting stuck with following as the last line: tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) localhost:54541

To Reproduce

I have followed all steps as outlined in https://github.com/mgrankin/ru_transformers/tree/master/tpu to train a GPT2 model on TPU on Google Cloud. As mentioned there, I was able to successfully run MNIST example without any issue python /pytorch/xla/test/test_train_mp_mnist.py But when I ran the full training which is on a small dataset (10MB) just to make sure it runs successfully, the training is getting stuck with above line and doesn't proceed further. When I press Ctrl-C, I can see it is waiting in socket polling. I have tried restarting the TPU but same problem is observed.

Steps to reproduce the behavior:

  1. Run the fit.sh present in the repo here: https://github.com/mgrankin/ru_transformers after all the necessary configuration.

Logs

TPU Hang.log

Expected behavior

Training should complete successfully.

Environment


Collecting environment information...
PyTorch version: 1.5.0a0+65bad41
Is debug build: No
CUDA used to build PyTorch: None

OS: Debian GNU/Linux 9 (stretch)
GCC version: (Debian 6.3.0-18+deb9u1) 6.3.0 20170516
CMake version: version 3.14.0

Python version: 3.6
Is CUDA available: No
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA

Versions of relevant libraries:
[pip] numpy==1.18.1
[pip] numpydoc==0.9.1
[pip] torch==1.5.0a0+65bad41
[pip] torch-xla==0.8+98a2790
[pip] torchvision==0.6.0a0+b6f28ec
[conda] blas                      1.0                         mkl  
[conda] mkl                       2019.4                      243  
[conda] mkl-service               2.3.0            py36he904b0f_0  
[conda] mkl_fft                   1.0.14           py36ha843d7b_0  
[conda] mkl_random                1.1.0            py36hd6b4f25_0  
[conda] torch                     1.5.0a0+65bad41           <pip>
[conda] torch-xla                 0.8+98a2790               <pip>
[conda] torchvision               0.6.0a0+b6f28ec           <pip>

```### Additional context

This is my first time using TPU for training.
github-actions[bot] commented 4 years ago

Hi! thanks for your contribution!, great first issue!

nikhilno1 commented 4 years ago

Is it the case that training is actually completing but command doesn't return, which is what I am used to seeing?

nikhilno1 commented 4 years ago

I am seeing the pytorch_model.bin getting created so which means training was successful. Closing the issue.