A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
Can't train/finetune TitaNet model using two RTX4090.
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[22], line 1
----> 1 trainer.fit(speaker_model)
File ~/miniconda3/envs/nemo/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py:532, in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
530 self.strategy._lightning_module = model
531 _verify_strategy_supports_compile(model, self.strategy)
--> 532 call._call_and_handle_interrupt(
533 self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
534 )
File ~/miniconda3/envs/nemo/lib/python3.12/site-packages/pytorch_lightning/trainer/call.py:42, in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs)
40 try:
41 if trainer.strategy.launcher is not None:
---> 42 return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
43 return trainer_fn(*args, **kwargs)
45 except _TunerExitException:
File ~/miniconda3/envs/nemo/lib/python3.12/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py:101, in _MultiProcessingLauncher.launch(self, function, trainer, *args, **kwargs)
99 self._check_torchdistx_support()
100 if self._start_method in ("fork", "forkserver"):
--> 101 _check_bad_cuda_fork()
103 # The default cluster environment in Lightning chooses a random free port number
104 # This needs to be done in the main process here before starting processes to ensure each rank will connect
105 # through the same port
106 assert self._strategy.cluster_environment is not None
File ~/miniconda3/envs/nemo/lib/python3.12/site-packages/lightning_fabric/strategies/launchers/multiprocessing.py:192, in _check_bad_cuda_fork()
190 if _IS_INTERACTIVE:
191 message += " You will have to restart the Python kernel."
--> 192 raise RuntimeError(message)
RuntimeError: Lightning can't create new processes if CUDA is already initialized. Did you manually call `torch.cuda.*` functions, have moved the model to the device, or allocated memory on the GPU any other way? Please remove any such calls, or change the selected strategy. You will have to restart the Python kernel.
Describe the bug
Can't train/finetune TitaNet model using two RTX4090.
Steps/Code to reproduce bug
config.trainer.devices = 2
Environment overview (please complete the following information)
Environment details
Additional context
2x RTX4090