Can't train/finetune a model on two RTX4090

maxpain commented 6 months ago

Describe the bug

Can't train/finetune TitaNet model using two RTX4090.

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[22], line 1
----> 1 trainer.fit(speaker_model)

File ~/miniconda3/envs/nemo/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py:532, in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
    530 self.strategy._lightning_module = model
    531 _verify_strategy_supports_compile(model, self.strategy)
--> 532 call._call_and_handle_interrupt(
    533     self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
    534 )

File ~/miniconda3/envs/nemo/lib/python3.12/site-packages/pytorch_lightning/trainer/call.py:42, in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs)
     40 try:
     41     if trainer.strategy.launcher is not None:
---> 42         return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
     43     return trainer_fn(*args, **kwargs)
     45 except _TunerExitException:

File ~/miniconda3/envs/nemo/lib/python3.12/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py:101, in _MultiProcessingLauncher.launch(self, function, trainer, *args, **kwargs)
     99 self._check_torchdistx_support()
    100 if self._start_method in ("fork", "forkserver"):
--> 101     _check_bad_cuda_fork()
    103 # The default cluster environment in Lightning chooses a random free port number
    104 # This needs to be done in the main process here before starting processes to ensure each rank will connect
    105 # through the same port
    106 assert self._strategy.cluster_environment is not None

File ~/miniconda3/envs/nemo/lib/python3.12/site-packages/lightning_fabric/strategies/launchers/multiprocessing.py:192, in _check_bad_cuda_fork()
    190 if _IS_INTERACTIVE:
    191     message += " You will have to restart the Python kernel."
--> 192 raise RuntimeError(message)

RuntimeError: Lightning can't create new processes if CUDA is already initialized. Did you manually call `torch.cuda.*` functions, have moved the model to the device, or allocated memory on the GPU any other way? Please remove any such calls, or change the selected strategy. You will have to restart the Python kernel.

Steps/Code to reproduce bug

Open https://github.com/NVIDIA/NeMo/blob/main/tutorials/speaker_tasks/Speaker_Identification_Verification.ipynb
Set config.trainer.devices = 2
Run it

Environment overview (please complete the following information)

Environment location: Bare-metal

Method of NeMo install:

conda create --name nemo python==3.12.3
conda activate nemo
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
conda install pip
conda install conda-forge::cython
pip install nemo_toolkit['asr']

Environment details

OS version: Ubuntu 24.04 Beta
PyTorch version: 2.2.2
Python version: 3.12.3

Additional context

2x RTX4090

nithinraok commented 5 months ago

Looks to me an issue with environment.

Pls run with python 3.10

FYI: @athitten

maxpain commented 5 months ago

This is because I ran it in Jupyter Notebook.

NVIDIA / NeMo

Can't train/finetune a model on two RTX4090 #8993