Training does not work with Google Colab TPUs - ZeroDivisionError

piraka9011 commented 3 years ago

Describe the bug

Training any model does not work with TPUs due to an error with the way modelPT.py calculates optim_config['sched']['t_num_workers'] here.

When you have 0 GPUs, but are still using TPUs, t_num_workers is 0.

This causes a division by zero error here.

Exception in device=TPU:0: division by zero
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 330, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 324, in _start_fn
    fn(gindex, *args)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/accelerators/tpu_accelerator.py", line 135, in tpu_train_in_process
    self.__setup_tpu_training(model, trainer)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/accelerators/tpu_accelerator.py", line 222, in __setup_tpu_training
    self.setup_optimizers(model)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/accelerators/accelerator.py", line 150, in setup_optimizers
    optimizers, lr_schedulers, optimizer_frequencies = self.trainer.init_optimizers(model)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/optimizers.py", line 31, in init_optimizers
    optim_conf = model.configure_optimizers()
  File "/usr/local/lib/python3.7/dist-packages/nemo/core/classes/modelPT.py", line 699, in configure_optimizers
    self.setup_optimization()
  File "/usr/local/lib/python3.7/dist-packages/nemo/core/classes/modelPT.py", line 691, in setup_optimization
    optimizer=self._optimizer, scheduler_config=scheduler_config, train_dataloader=self._train_dl
  File "/usr/local/lib/python3.7/dist-packages/nemo/core/optim/lr_scheduler.py", line 596, in prepare_lr_scheduler
    drop_last=drop_last,
  File "/usr/local/lib/python3.7/dist-packages/nemo/core/optim/lr_scheduler.py", line 646, in compute_max_steps
    sampler_num_samples = math.ceil(num_samples / num_workers)
ZeroDivisionError: division by zero

Steps/Code to reproduce bug

Follow any colab with TPU support and set trainer.tpu_cores=8 in the config.

from hydra.experimental import compose, initialize
import pytorch_lightning as pl
from omegaconf import OmegaConf

from nemo.collections.asr.models import EncDecCTCModel
from nemo.core.config import hydra_runner
from nemo.utils import logging
from nemo.utils.exp_manager import exp_manager

with initialize(config_path="configs"):
    cfg = compose(config_name="config")

print(f'Hydra config: {OmegaConf.to_yaml(cfg)}')
trainer = pl.Trainer(**cfg.trainer)
exp_manager(trainer, cfg.get("exp_manager", None))
asr_model = EncDecCTCModel(cfg=cfg.model, trainer=trainer)

trainer.fit(asr_model)

Expected behavior

Training works.

Environment overview (please complete the following information)

Environment location: Colab
Method of NeMo install: pip install nemo_toolkit['all']==1.0.0rc1

Environment details

Otherwise, please provide:

OS version: Debian ? Not sure what colab runs
PyTorch version: 1.7
Python version: 3.7

Additional context

Installed PyTorch XLA using:

pip install cloud-tpu-client==0.10 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.7-cp37-cp37m-linux_x86_64.whl
pip install pytorch-lightning

piraka9011 commented 3 years ago

I'm not too familiar with the internals of PTL, but I'm guessing the solution is to determine whether there are GPUs or TPUs and not just GPUs.

titu1994 commented 3 years ago

We don't support TPU training, and it's not tested. If you get it working, please let us know.

okuchaiev commented 3 years ago

looks like you can try disabling learning rate scheduler. but, yes, training on TPU's isn't something we maintain - in G's Colab you should be able to just use GPUs

NVIDIA / NeMo

Training does not work with Google Colab TPUs - ZeroDivisionError #2188