Closed simonepri closed 4 years ago
Hi! thanks for your contribution!, great first issue!
Any update? Can I help somehow speeding this up?
I was facing the same issue on a Colab TPU instance.
pytorch-lightning==0.7.6
torch==1.6.0a0+246d7bb
torch-xla==1.6+62b4c42
torchvision==0.7.0a0+c2e8a00
Using trainer = pl.Trainer(resume_from_checkpoint=str(best_ckpt), num_tpu_cores=1)
followed by: trainer.test(model)
results in:
training on 1 TPU cores
INIT TPU local core: 0, global rank: 0
Exception in device=TPU:0: tensorflow/compiler/xla/xla_client/mesh_service.cc:259 : Check failed: impl_->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds))
*** Begin stack trace ***
tensorflow::CurrentStackTrace[abi:cxx11]()
xla::service::MeshClient::MeshClient(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
xla::service::MeshClient::Get()
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyFunction_FastCallDict
PyObject_Call
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyFunction_FastCallDict
PyObject_Call
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
PyObject_Call
Py_Main
main
__libc_start_main
_start
*** End stack trace ***
Failed to connect to client mesh master: 06e59f028bed:60141
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 231, in _start_fn
fn(gindex, *args)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/distrib_parts.py", line 535, in tpu_train
self.run_pretrain_routine(model)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 951, in run_pretrain_routine
torch_xla.core.xla_model.rendezvous("pl.Trainer.run_pretrain_routine")
File "/usr/local/lib/python3.6/dist-packages/torch_xla/core/xla_model.py", line 679, in rendezvous
return torch_xla._XLAC._xla_rendezvous(get_ordinal(), tag, payload, replicas)
RuntimeError: tensorflow/compiler/xla/xla_client/mesh_service.cc:259 : Check failed: impl_->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds))
*** Begin stack trace ***
tensorflow::CurrentStackTrace[abi:cxx11]()
xla::service::MeshClient::MeshClient(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
xla::service::MeshClient::Get()
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyFunction_FastCallDict
PyObject_Call
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyFunction_FastCallDict
PyObject_Call
_PyEval_EvalFrameDefault
PyObject_Call
_PyEval_EvalFrameDefault
PyObject_Call
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyFunction_FastCallDict
PyObject_Call
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
_PyEval_EvalFrameDefault
PyObject_Call
Py_Main
main
__libc_start_main
_start
*** End stack trace ***
Failed to connect to client mesh master: 06e59f028bed:60141
An exception has occurred, use %tb to see the full traceback.
I was facing the same issue on a Colab TPU instance.
pytorch-lightning==0.7.6 torch==1.6.0a0+246d7bb torch-xla==1.6+62b4c42 torchvision==0.7.0a0+c2e8a00
Using
trainer = pl.Trainer(resume_from_checkpoint=str(best_ckpt), num_tpu_cores=1)
followed by:trainer.test(model)
results in:
training on 1 TPU cores ... Failed to connect to client mesh master: 06e59f028bed:60141 An exception has occurred, use %tb to see the full traceback.
Did u get any solution? I am facing the same issue
Hi @inidhinarayan, I couldn't find a way around it. You might want to try with the latest repo!
@nidhinarayan can you let us know if this is still happening on master?
it shall be fixed on master, feel free to reopen if needed 🐰
I spent some time debugging this issue. I suspect the problem occurs as lightning loads xla weights
back on the device. The weights are saved by the master device xla:1
during training. When reloading, these weights are automatically moved back to xla:1
. When this happens the current process automatically acquires only one TPU core and considers it as a TPU device. Any attempt to call xmp.spawn
after this will result in the given error.
To fix this issue we need to save weights using xm.save()
instead of torch.save
. This is will transfer the weights to a cpu device before saving. This issue is related to https://github.com/PyTorchLightning/pytorch-lightning/pull/2726
Am still facing this issue today on Kaggle:
training on 8 TPU cores training on 8 TPU cores Exception in device=TPU:0: Cannot replicate if number of devices (1) is different from 8 Traceback (most recent call last): File "/opt/conda/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn _start_fn(index, pf_cfg, fn, args) File "/opt/conda/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 322, in _start_fn _setup_replication() Exception in device=TPU:1: Cannot replicate if number of devices (1) is different from 8 File "/opt/conda/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 315, in _setup_replication xm.set_replication(device, [device]) Traceback (most recent call last): File "/opt/conda/lib/python3.7/site-packages/torch_xla/core/xla_model.py", line 317, in set_replication replication_devices = xla_replication_devices(devices) File "/opt/conda/lib/python3.7/site-packages/torch_xla/core/xla_model.py", line 287, in xla_replication_devices format(len(local_devices), len(kind_devices))) File "/opt/conda/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn _start_fn(index, pf_cfg, fn, args) File "/opt/conda/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 322, in _start_fn _setup_replication() RuntimeError: Cannot replicate if number of devices (1) is different from 8 Exception in device=TPU:2: Cannot replicate if number of devices (1) is different from 8 File "/opt/conda/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 315, in _setup_replication xm.set_replication(device, [device]) File "/opt/conda/lib/python3.7/site-packages/torch_xla/core/xla_model.py", line 287, in xla_replication_devices format(len(local_devices), len(kind_devices))) Traceback (most recent call last): File "/opt/conda/lib/python3.7/site-packages/torch_xla/core/xla_model.py", line 317, in set_replication replication_devices = xla_replication_devices(devices) File "/opt/conda/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn _start_fn(index, pf_cfg, fn, args) RuntimeError: Cannot replicate if number of devices (1) is different from 8 Exception in device=TPU:3: Cannot replicate if number of devices (1) is different from 8 Exception in device=TPU:4: Cannot replicate if number of devices (1) is different from 8 File "/opt/conda/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 322, in _start_fn _setup_replication() Traceback (most recent call last): Traceback (most recent call last): File "/opt/conda/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn _start_fn(index, pf_cfg, fn, args) File "/opt/conda/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn _start_fn(index, pf_cfg, fn, args) File "/opt/conda/lib/python3.7/site-packages/torch_xla/core/xla_model.py", line 317, in set_replication replication_devices = xla_replication_devices(devices) File "/opt/conda/lib/python3.7/site-packages/torch_xla/core/xla_model.py", line 317, in set_replication replication_devices = xla_replication_devices(devices) RuntimeError: Cannot replicate if number of devices (1) is different from 8 File "/opt/conda/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 322, in _start_fn _setup_replication() File "/opt/conda/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 322, in _start_fn _setup_replication() File "/opt/conda/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 315, in _setup_replication xm.set_replication(device, [device]) File "/opt/conda/lib/python3.7/site-packages/torch_xla/core/xla_model.py", line 317, in set_replication replication_devices = xla_replication_devices(devices) File "/opt/conda/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 315, in _setup_replication xm.set_replication(device, [device]) File "/opt/conda/lib/python3.7/site-packages/torch_xla/core/xla_model.py", line 287, in xla_replication_devices format(len(local_devices), len(kind_devices))) File "/opt/conda/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 315, in _setup_replication xm.set_replication(device, [device]) File "/opt/conda/lib/python3.7/site-packages/torch_xla/core/xla_model.py", line 287, in xla_replication_devices format(len(local_devices), len(kind_devices))) RuntimeError: Cannot replicate if number of devices (1) is different from 8 Exception in device=TPU:5: Cannot replicate if number of devices (1) is different from 8 File "/opt/conda/lib/python3.7/site-packages/torch_xla/core/xla_model.py", line 287, in xla_replication_devices format(len(local_devices), len(kind_devices))) Exception in device=TPU:6: Cannot replicate if number of devices (1) is different from 8 Traceback (most recent call last): RuntimeError: Cannot replicate if number of devices (1) is different from 8 Traceback (most recent call last): File "/opt/conda/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn _start_fn(index, pf_cfg, fn, args) File "/opt/conda/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 322, in _start_fn _setup_replication() Exception in device=TPU:7: Cannot replicate if number of devices (1) is different from 8 File "/opt/conda/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn _start_fn(index, pf_cfg, fn, args) Traceback (most recent call last): File "/opt/conda/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 315, in _setup_replication xm.set_replication(device, [device]) File "/opt/conda/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn _start_fn(index, pf_cfg, fn, args) File "/opt/conda/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 322, in _start_fn _setup_replication() File "/opt/conda/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 315, in _setup_replication xm.set_replication(device, [device]) File "/opt/conda/lib/python3.7/site-packages/torch_xla/core/xla_model.py", line 317, in set_replication replication_devices = xla_replication_devices(devices) File "/opt/conda/lib/python3.7/site-packages/torch_xla/core/xla_model.py", line 317, in set_replication replication_devices = xla_replication_devices(devices) File "/opt/conda/lib/python3.7/site-packages/torch_xla/core/xla_model.py", line 287, in xla_replication_devices format(len(local_devices), len(kind_devices))) File "/opt/conda/lib/python3.7/site-packages/torch_xla/core/xla_model.py", line 287, in xla_replication_devices format(len(local_devices), len(kind_devices))) RuntimeError: Cannot replicate if number of devices (1) is different from 8
I'm experiencing the same issue with the [Lightning TPU example notebook] (https://colab.research.google.com/github/PytorchLightning/pytorch-lightning/blob/master/notebooks/06-mnist-tpu-training.ipynb), run on Colab.
Both the single TPU core examples work, but when trying to run on 8 cores I get the error:
"RuntimeError: Cannot replicate if number of devices (1) is different from 8"
Having the same issue with a kaggle kernel .
Exception in device=TPU:0: Cannot replicate if number of devices (1) is different from 8 Traceback (most recent call last): Exception in device=TPU:1: Cannot replicate if number of devices (1) is different from 8 File "/opt/conda/lib/python3.7/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 330, in _mp_start_fn _start_fn(index, pf_cfg, fn, args)
@LisburnLad @Sanjay03079 did you run
trainer = pl.Trainer(max_epochs=3, progress_bar_refresh_rate=20, tpu_cores=[5])
before you ran
trainer = pl.Trainer(max_epochs=3, progress_bar_refresh_rate=20, tpu_cores=8)
?
@lezwon i run this trainer = pl.Trainer(max_epochs=3, progress_bar_refresh_rate=20, tpu_cores=8)
only and tried to restart kernal also. but it did not work
I am experiencing the same issue in #9712. https://app.circleci.com/pipelines/github/PyTorchLightning/pytorch-lightning/44836/workflows/20f3cc67-3596-4d27-8ecb-c909a3cf6577/jobs/132588/parallel-runs/0/steps/0-119
@Borda any insight here?
The problem feels unsolvable.
I confirm the problem still exists.
@satpalsr Could you share your reproducible script for the bug? I could take a look.
Hello, any news? I am facing the same problem.
🐛 Bug
When I run
trainer.test(model)
on a pre-trained model using a Colab TPU instance, the following exception is thrown.Stack trace
Code sample
Environment
Colab TPU instance with XLA 1.5
Possibly related: https://github.com/PyTorchLightning/pytorch-lightning/pull/1019