NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.96k stars 2.49k forks source link

Can't train/finetune a model on two RTX4090 #8993

Closed maxpain closed 5 months ago

maxpain commented 6 months ago

Describe the bug

Can't train/finetune TitaNet model using two RTX4090.

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[22], line 1
----> 1 trainer.fit(speaker_model)

File ~/miniconda3/envs/nemo/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py:532, in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
    530 self.strategy._lightning_module = model
    531 _verify_strategy_supports_compile(model, self.strategy)
--> 532 call._call_and_handle_interrupt(
    533     self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
    534 )

File ~/miniconda3/envs/nemo/lib/python3.12/site-packages/pytorch_lightning/trainer/call.py:42, in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs)
     40 try:
     41     if trainer.strategy.launcher is not None:
---> 42         return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
     43     return trainer_fn(*args, **kwargs)
     45 except _TunerExitException:

File ~/miniconda3/envs/nemo/lib/python3.12/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py:101, in _MultiProcessingLauncher.launch(self, function, trainer, *args, **kwargs)
     99 self._check_torchdistx_support()
    100 if self._start_method in ("fork", "forkserver"):
--> 101     _check_bad_cuda_fork()
    103 # The default cluster environment in Lightning chooses a random free port number
    104 # This needs to be done in the main process here before starting processes to ensure each rank will connect
    105 # through the same port
    106 assert self._strategy.cluster_environment is not None

File ~/miniconda3/envs/nemo/lib/python3.12/site-packages/lightning_fabric/strategies/launchers/multiprocessing.py:192, in _check_bad_cuda_fork()
    190 if _IS_INTERACTIVE:
    191     message += " You will have to restart the Python kernel."
--> 192 raise RuntimeError(message)

RuntimeError: Lightning can't create new processes if CUDA is already initialized. Did you manually call `torch.cuda.*` functions, have moved the model to the device, or allocated memory on the GPU any other way? Please remove any such calls, or change the selected strategy. You will have to restart the Python kernel.

Steps/Code to reproduce bug

  1. Open https://github.com/NVIDIA/NeMo/blob/main/tutorials/speaker_tasks/Speaker_Identification_Verification.ipynb
  2. Set config.trainer.devices = 2
  3. Run it

Environment overview (please complete the following information)

Environment details

Additional context

2x RTX4090

nithinraok commented 5 months ago

Looks to me an issue with environment.

Pls run with python 3.10

FYI: @athitten

maxpain commented 5 months ago

This is because I ran it in Jupyter Notebook.