NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
12.05k stars 2.51k forks source link

Unable to train fastpitch_finetune with nemo:1.8.0 #4035

Closed itzsimpl closed 2 years ago

itzsimpl commented 2 years ago

When I try to run experiments/tts/fastpitch_finetune.py with fastpitch_align_v1.05.yml with ddp on multiple GPUs the training immediately crashes with the following trace

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 809, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1234, in _run
    results = self._run_stage()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1321, in _run_stage
    return self._run_train()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1351, in _run_train
    self.fit_loop.run()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 269, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 231, in advance
    self.trainer._call_callback_hooks("on_train_batch_end", batch_end_outputs, batch, batch_idx, **extra_kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1628, in _call_callback_hooks
    self._on_train_batch_end(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1660, in _on_train_batch_end
    callback.on_train_batch_end(self, self.lightning_module, outputs, batch, batch_idx)
  File "/workspace/nemo/nemo/utils/exp_manager.py", line 144, in on_train_batch_end
    self._on_batch_end("train_step_timing", pl_module)
  File "/workspace/nemo/nemo/utils/exp_manager.py", line 138, in _on_batch_end
    pl_module.log(name, self.timer[name], on_step=True, on_epoch=False)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 381, in log
    value = apply_to_collection(value, numbers.Number, self.__to_tensor)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 99, in apply_to_collection
    return function(data, *args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 515, in __to_tensor
    return torch.tensor(value, device=self.device)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "examples/tts/fastpitch_finetune.py", line 41, in <module>
    main()  # noqa pylint: disable=no-value-for-parameter
  File "/workspace/nemo/nemo/core/config/hydra_runner.py", line 104, in wrapper
    _run_hydra(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 377, in _run_hydra
    run_and_report(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 214, in run_and_report
    raise ex
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 378, in <lambda>
    lambda: hydra.run(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 111, in run
    _ = ret.return_value
  File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "examples/tts/fastpitch_finetune.py", line 37, in main
    trainer.fit(model)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 768, in fit
    self._call_and_handle_interrupt(
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 736, in _call_and_handle_interrupt
    self._teardown()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1298, in _teardown
    self.strategy.teardown()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 447, in teardown
    super().teardown()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/strategies/parallel.py", line 134, in teardown
    super().teardown()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 444, in teardown
    optimizers_to_device(self.optimizers, torch.device("cpu"))
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/optimizer.py", line 27, in optimizers_to_device
    optimizer_to_device(opt, device)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/optimizer.py", line 33, in optimizer_to_device
    optimizer.state[p] = apply_to_collection(v, torch.Tensor, move_data_to_device, device)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 107, in apply_to_collection
    v = apply_to_collection(
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 99, in apply_to_collection
    return function(data, *args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 354, in move_data_to_device
    return apply_to_collection(batch, dtype=dtype, function=batch_to)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 99, in apply_to_collection
    return function(data, *args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 347, in batch_to
    data_output = data.to(device, **kwargs)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

This does not happen if I run on a single GPU.

itzsimpl commented 2 years ago

A little update. Fiddling with fastpitch_align_v1.05.yml is seems that the issue arises from the default optimiser. Indeed the default optimiser is lamb, and as soon as I change it to e.g. adamw, I can safely run on multiple GPUs.

itzsimpl commented 2 years ago

Changing the optimizer solves running on a single node with multiple GPUs, but trying to run multi-node multi-gpu distributed training again results in a crash, this time with the following trace

Traceback (most recent call last):
  File "examples/tts/fastpitch_finetune.py", line 41, in <module>
    main()  # noqa pylint: disable=no-value-for-parameter
  File "/workspace/nemo/nemo/core/config/hydra_runner.py", line 104, in wrapper
    _run_hydra(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 377, in _run_hydra
    run_and_report(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 214, in run_and_report
    raise ex
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 378, in <lambda>
    lambda: hydra.run(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 111, in run
    _ = ret.return_value
  File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "examples/tts/fastpitch_finetune.py", line 37, in main
    trainer.fit(model)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 768, in fit
    self._call_and_handle_interrupt(
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 809, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1170, in _run
    self.__setup_profiler()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1795, in __setup_profiler
    self.profiler.setup(stage=self.state.fn._setup_fn, local_rank=local_rank, log_dir=self.log_dir)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 2232, in log_dir
    dirpath = self.strategy.broadcast(dirpath)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 311, in broadcast
    torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1872, in broadcast_object_list
    broadcast(object_sizes_tensor, src=src, group=group)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1188, in broadcast
    work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, unhandled system error, NCCL version 21.2.9
ncclSystemError: System call (socket, malloc, munmap, etc) failed.

I'm running on SLURM 21.08.5, nvidia drivers are 510.47.03, nemo:1.8.0 was built by cloning the repo into pytorch:22.03-py3 and running reinstall.sh.

redoctopus commented 2 years ago

Can you provide the commands you were using to run the script, and any config changes? Are there any extra error logs on the other node(s)?

Also, I believe PyTorch Lightning has some trouble if the nodes have a different number of GPUs. Can you check if this is the case?

itzsimpl commented 2 years ago

This specific trace comes from a cluster with 2 nodes, both with 8 GPUs. They are different nodes, with a different network, CPU, RAM and GPU type config. On this cluster I have tried running also ASR multi-node multi-gpu training, and the result is the same.

The command I am running is as follows

export NCCL_IB_DISABLE=1
export NCCL_DEBUG=INFO

# train
python examples/tts/fastpitch_finetune.py \
  --config-path=conf \
  --config-name=fastpitch_align_v1.05 \
  exp_manager.name=fastpitch \
  +exp_manager.exp_dir=/experiments \
  +exp_manager.version=20220421-0818 \
  exp_manager.resume_if_exists=true \
  exp_manager.resume_ignore_no_checkpoint=true \
  +exp_manager.checkpoint_callback_params.save_best_model=true \
  train_dataset=/data/manifests/manifest_v4_train.json \
  validation_datasets=/data/manifests/manifest_v4_valid.json \
  sup_data_path=/data/sup/v4 \
  ~phoneme_dict_path \
  ~heteronyms_path \
  ~whitelist_path \
  ~model.text_normalizer \
  ~model.text_normalizer_call_kwargs \
  ~model.text_tokenizer \
  +model.text_tokenizer='{_target_:nemo.collections.tts.torch.tts_tokenizers.EnglishCharsTokenizer,apostrophe:true,pad_with_space:true}' \
  model.train_ds.dataloader_params.batch_size=24 \
  model.train_ds.dataloader_params.num_workers=4 \
  model.validation_ds.dataloader_params.batch_size=24 \
  model.validation_ds.dataloader_params.num_workers=4 \
  model.optim.name=adamw \
  trainer.devices=-1 \
  trainer.num_nodes=2 \
  trainer.precision=32 \

and this is the corresponding log output

[NeMo W 2022-04-21 08:18:55 nemo_logging:349] /opt/conda/lib/python3.8/site-packages/pydub/utils.py:170: RuntimeWarning: Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work
      warn("Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work", RuntimeWarning)

[NeMo W 2022-04-21 08:18:56 experimental:27] Module <class 'nemo.collections.nlp.data.language_modeling.megatron.megatron_batch_samplers.MegatronPretrainingRandomBatchSampler'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2022-04-21 08:18:57 fastpitch_finetune:27] You are using an optimizer scheduler while finetuning. Are you sure this is intended?
[NeMo W 2022-04-21 08:18:57 fastpitch_finetune:29] The recommended learning rate for finetuning is 2e-4
Multiprocessing is handled by SLURM.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[NeMo E 2022-04-21 08:18:57 exp_manager:368] You are running multi-node training without SLURM handling the processes. Please note that this is not tested in NeMo and could result in errors.
[NeMo W 2022-04-21 08:18:57 exp_manager:409] There was no checkpoint folder at checkpoint_dir :/experiments/fastpitch/20220421-0818/checkpoints. Training from scratch.
[NeMo I 2022-04-21 08:18:57 exp_manager:281] Experiments will be logged at /experiments/fastpitch/20220421-0818
[NeMo I 2022-04-21 08:18:57 exp_manager:647] TensorboardLogger has been set up
[NeMo W 2022-04-21 08:18:57 nemo_logging:349] /opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:2302: LightningDeprecationWarning: `Trainer.weights_save_path` has been deprecated in v1.6 and will be removed in v1.8.
      rank_zero_deprecation("`Trainer.weights_save_path` has been deprecated in v1.6 and will be removed in v1.8.")

Created a temporary directory at /tmp/tmpo20qduxs
Writing /tmp/tmpo20qduxs/_remote_module_non_sriptable.py
[NeMo I 2022-04-21 08:18:57 data:173] Loading dataset from /data/manifests/manifest_v4_train.json.
[NeMo I 2022-04-21 08:18:58 data:207] Loaded dataset with 10322 files.
[NeMo I 2022-04-21 08:18:58 data:209] Dataset contains 16.66 hours.
[NeMo I 2022-04-21 08:18:58 data:297] Pruned 0 files. Final dataset contains 10322 files
[NeMo I 2022-04-21 08:18:58 data:299] Pruned 0.00 hours. Final dataset contains 16.66 hours.
[NeMo I 2022-04-21 08:18:58 data:173] Loading dataset from /data/manifests/manifest_v4_valid.json.

0it [00:00, ?it/s]
4055it [00:00, 40546.46it/s]
8221it [00:00, 41197.27it/s]
10322it [00:00, 41361.45it/s]
[NeMo I 2022-04-21 08:18:58 data:207] Loaded dataset with 543 files.
[NeMo I 2022-04-21 08:18:58 data:209] Dataset contains 0.84 hours.
[NeMo I 2022-04-21 08:18:58 data:297] Pruned 0 files. Final dataset contains 543 files
[NeMo I 2022-04-21 08:18:58 data:299] Pruned 0.00 hours. Final dataset contains 0.84 hours.
[NeMo I 2022-04-21 08:18:58 features:259] PADDING: 1
[NeMo I 2022-04-21 08:18:58 features:276] STFT using torch

0it [00:00, ?it/s]
543it [00:00, 34682.67it/s]
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
Created a temporary directory at /tmp/tmpmy42n7nr
Writing /tmp/tmpmy42n7nr/_remote_module_non_sriptable.py

0it [00:00, ?it/s]
3771it [00:00, 37706.53it/s]
7542it [00:00, 37416.53it/s]
10322it [00:00, 37234.35it/s]

0it [00:00, ?it/s]
543it [00:00, 22735.51it/s]
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
Added key: store_based_barrier_key:1 to store for rank: 1
Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------

Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
gpu01:596387:596387 [0] NCCL INFO Bootstrap : Using enp225s0f1:10.10.10.2<0>
gpu01:596387:596387 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v5 symbol.
gpu01:596387:596387 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v5 symbol.
gpu01:596387:596387 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
gpu01:596387:596387 [0] NCCL INFO P2P plugin IBext
gpu01:596387:596387 [0] NCCL INFO NET/IB : Using [0]mlx5_9:1/RoCE ; OOB enp225s0f1:10.10.10.2<0>
gpu01:596387:596387 [0] NCCL INFO Using network IBext
NCCL version 2.12.9+cuda11.6
gpu02:693036:693036 [0] NCCL INFO Bootstrap : Using eno1:10.10.10.3<0>
gpu02:693036:693036 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v5 symbol.
gpu02:693036:693036 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v5 symbol.
gpu02:693036:693036 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
gpu02:693036:693036 [0] NCCL INFO P2P plugin IBext
gpu02:693036:693036 [0] NCCL INFO NET/IB : No device found.
gpu02:693036:693036 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
gpu02:693036:693036 [0] NCCL INFO NET/Socket : Using [0]eno1:10.10.10.3<0> [1]vethd257813:fe80::7801:5ff:fe18:2cbc%vethd257813<0> [2]veth3fb8ef2:fe80::dc58:96ff:fe78:44b%veth3fb8ef2<0>
gpu02:693036:693036 [0] NCCL INFO Using network Socket
gpu02:693036:693058 [0] NCCL INFO Setting affinity for GPU 0 to 03,00000003
gpu01:596387:597126 [0] NCCL INFO PXN Disabled as plugin is v4
gpu01:596387:597126 [0] NCCL INFO Channel 00/02 :    0   1
gpu01:596387:597126 [0] NCCL INFO Channel 01/02 :    0   1
gpu01:596387:597126 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1
gpu02:693036:693058 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1
gpu01:596387:597126 [0] NCCL INFO Channel 00/0 : 1[1b000] -> 0[7000] [receive] via NET/IBext/0
gpu02:693036:693058 [0] NCCL INFO Channel 00/0 : 0[7000] -> 1[1b000] [receive] via NET/Socket/1
gpu02:693036:693058 [0] NCCL INFO Channel 01/0 : 0[7000] -> 1[1b000] [receive] via NET/Socket/1
gpu02:693036:693058 [0] NCCL INFO Channel 00/0 : 1[1b000] -> 0[7000] [send] via NET/Socket/1
gpu02:693036:693058 [0] NCCL INFO Channel 01/0 : 1[1b000] -> 0[7000] [send] via NET/Socket/1
gpu01:596387:597126 [0] NCCL INFO Channel 01/0 : 1[1b000] -> 0[7000] [receive] via NET/IBext/0
gpu01:596387:597126 [0] NCCL INFO Channel 00/0 : 0[7000] -> 1[1b000] [send] via NET/IBext/0
gpu01:596387:597126 [0] NCCL INFO Channel 01/0 : 0[7000] -> 1[1b000] [send] via NET/IBext/0

gpu01:596387:597129 [0] ../include/socket.h:407 NCCL WARN Connect to fe80::7801:5ff:fe18:2cbc%ibp75s0<45553> failed : Network is unreachable
gpu01:596387:597129 [0] NCCL INFO ib_plugin.c:266 -> 2
gpu01:596387:597129 [0] NCCL INFO include/net.h:25 -> 2
gpu01:596387:597129 [0] NCCL INFO transport/net.cc:515 -> 2
gpu01:596387:597129 [0] NCCL INFO proxy.cc:914 -> 2
gpu01:596387:597129 [0] NCCL INFO proxy.cc:942 -> 2

gpu01:596387:597129 [0] proxy.cc:1042 NCCL WARN [Proxy Service 0] Failed to execute operation Connect from rank 0, retcode 2

gpu01:596387:597129 [0] ../include/socket.h:407 NCCL WARN Connect to fe80::7801:5ff:fe18:2cbc%ibp75s0<45553> failed : Network is unreachable
gpu01:596387:597129 [0] NCCL INFO ib_plugin.c:266 -> 2
gpu01:596387:597129 [0] NCCL INFO include/net.h:25 -> 2
gpu01:596387:597129 [0] NCCL INFO transport/net.cc:515 -> 2
gpu01:596387:597129 [0] NCCL INFO proxy.cc:914 -> 2

gpu01:596387:597129 [0] proxy.cc:1042 NCCL WARN [Proxy Service 0] Failed to execute operation Connect from rank 0, retcode 2

gpu01:596387:597126 [0] misc/socket.cc:523 NCCL WARN Net : Connection closed by remote peer gpu01<52753>
gpu01:596387:597126 [0] NCCL INFO misc/socket.cc:531 -> 2
gpu01:596387:597126 [0] NCCL INFO misc/socket.cc:543 -> 2
gpu01:596387:597126 [0] NCCL INFO proxy.cc:805 -> 2

gpu01:596387:597126 [0] proxy.cc:808 NCCL WARN Proxy Call to rank 0 failed (Connect)
gpu01:596387:597126 [0] NCCL INFO transport/net.cc:269 -> 2
gpu01:596387:597126 [0] NCCL INFO transport.cc:127 -> 2
gpu01:596387:597126 [0] NCCL INFO init.cc:730 -> 2
gpu01:596387:597126 [0] NCCL INFO init.cc:915 -> 2
gpu01:596387:597126 [0] NCCL INFO group.cc:57 -> 2 [Async thread]
Error executing job with overrides: ['exp_manager.name=fastpitch', '+exp_manager.exp_dir=/experiments', '+exp_manager.version=20220421-0818', 'exp_manager.resume_if_exists=true', 'exp_manager.resume_ignore_no_checkpoint=true', '+exp_manager.checkpoint_callback_params.save_best_model=true', 'train_dataset=/data/manifests/manifest_v4_train.json', 'validation_datasets=/data/manifests/manifest_v4_valid.json', 'sup_data_path=/data/sup/v4', '~phoneme_dict_path', '~heteronyms_path', '~whitelist_path', '~model.text_normalizer', '~model.text_normalizer_call_kwargs', '~model.text_tokenizer', '+model.text_tokenizer={_target_:nemo.collections.tts.torch.tts_tokenizers.EnglishCharsTokenizer,apostrophe:true,pad_with_space:true}', 'model.train_ds.dataloader_params.batch_size=24', 'model.train_ds.dataloader_params.num_workers=4', 'model.validation_ds.dataloader_params.batch_size=24', 'model.validation_ds.dataloader_params.num_workers=4', 'model.optim.name=adamw', 'trainer.devices=-1', 'trainer.num_nodes=2', 'trainer.precision=32']
Traceback (most recent call last):
  File "examples/tts/fastpitch_finetune.py", line 41, in <module>
    main()  # noqa pylint: disable=no-value-for-parameter
  File "/workspace/nemo/nemo/core/config/hydra_runner.py", line 104, in wrapper
    _run_hydra(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 377, in _run_hydra
    run_and_report(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 214, in run_and_report
    raise ex
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 378, in <lambda>
    lambda: hydra.run(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 111, in run
    _ = ret.return_value
  File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "examples/tts/fastpitch_finetune.py", line 37, in main
    trainer.fit(model)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 768, in fit
    self._call_and_handle_interrupt(
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 809, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1170, in _run
    self.__setup_profiler()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1795, in __setup_profiler
    self.profiler.setup(stage=self.state.fn._setup_fn, local_rank=local_rank, log_dir=self.log_dir)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 2232, in log_dir
    dirpath = self.strategy.broadcast(dirpath)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 311, in broadcast
    torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1872, in broadcast_object_list
    broadcast(object_sizes_tensor, src=src, group=group)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1188, in broadcast
    work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, unhandled system error, NCCL version 21.2.9
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
itzsimpl commented 2 years ago

Multi-node multi-gpu training works if I set the following NCCL env variables:

export NCCL_IB_DISABLE=1
export NCCL_IBEXT_DISABLE=1

The same can be achieved if setting only

export NCCL_NET=Socket

but with NCCL 2.12.9+cuda11.6 this leads to seg fault, so until patch https://github.com/NVIDIA/nccl/issues/676#issuecomment-1106254219 gets merged, the first solution is preferred.

itzsimpl commented 2 years ago

I'm reopening this issue just as a reminder that when using model.optim.name=lamb training still crashes.

redoctopus commented 2 years ago

Can you try running with CUDA_LAUNCH_BLOCKING=1? It may give a more informative stack trace.

itzsimpl commented 2 years ago

Unfortunately there's not much more information

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 809, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1234, in _run
    results = self._run_stage()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1321, in _run_stage
    return self._run_train()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1351, in _run_train
    self.fit_loop.run()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 269, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 208, in advance
    batch_output = self.batch_loop.run(batch, batch_idx)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance
    outputs = self.optimizer_loop.run(split_batch, optimizers, batch_idx)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 203, in advance
    result = self._run_optimization(
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 256, in _run_optimization
    self._optimizer_step(optimizer, opt_idx, batch_idx, closure)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 369, in _optimizer_step
    self.trainer._call_lightning_module_hook(
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1593, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 1625, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 168, in step
    step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 278, in optimizer_step
    optimizer_output = super().optimizer_step(optimizer, opt_idx, closure, model, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 193, in optimizer_step
    return self.precision_plugin.optimizer_step(model, optimizer, opt_idx, closure, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 155, in optimizer_step
    return optimizer.step(closure=closure, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 65, in wrapper
    return wrapped(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/optim/optimizer.py", line 88, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/apex/optimizers/fused_lamb.py", line 124, in step
    g_norm_32 = multi_tensor_applier(self.multi_tensor_l2norm,
  File "/opt/conda/lib/python3.8/site-packages/apex/multi_tensor_apply/multi_tensor_apply.py", line 27, in __call__
    return op(self.chunk_size,
RuntimeError: CUDA error: an illegal memory access was encountered

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "examples/tts/fastpitch_finetune.py", line 41, in <module>
    main()  # noqa pylint: disable=no-value-for-parameter
  File "/workspace/nemo/nemo/core/config/hydra_runner.py", line 104, in wrapper
    _run_hydra(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 377, in _run_hydra
    run_and_report(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 214, in run_and_report
    raise ex
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 378, in <lambda>
    lambda: hydra.run(
  File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 111, in run
    _ = ret.return_value
  File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "examples/tts/fastpitch_finetune.py", line 37, in main
    trainer.fit(model)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 768, in fit
    self._call_and_handle_interrupt(
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 736, in _call_and_handle_interrupt
    self._teardown()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1298, in _teardown
    self.strategy.teardown()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 474, in teardown
    self.lightning_module.cpu()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/core/mixins/device_dtype_mixin.py", line 147, in cpu
    return super().cpu()
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 719, in cpu
    return self._apply(lambda t: t.cpu())
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  [Previous line repeated 1 more time]
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 602, in _apply
    param_applied = fn(param)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 719, in <lambda>
    return self._apply(lambda t: t.cpu())
RuntimeError: CUDA error: an illegal memory access was encountered

While testing I have, however, discovered also that if I add a small sleep to the script that is passed to SLURM immediately before calling python it suddenly starts working. The sleep is however HW dependent, on 2xA5000 GPUs 5s were enough on 2xA100 SXM4 GPUS 15s were required.

itzsimpl commented 2 years ago

Btw. are these sections of code still valid? https://github.com/NVIDIA/NeMo/blob/d9a329edf82ffed2cdc3ec1ca00539e90cbe7e3f/nemo/core/classes/modelPT.py#L486-L500 and https://github.com/NVIDIA/NeMo/blob/d9a329edf82ffed2cdc3ec1ca00539e90cbe7e3f/nemo/collections/tts/helpers/helpers.py#L93-L104

A recent change in PTL flags (https://github.com/NVIDIA/NeMo/pull/3589) renamed accelerator (dp, ddp, ddp2,...) to strategy. Since then accelerator holds gpu, cpu, ...

itzsimpl commented 2 years ago

I have managed to get a little bit more trace. Just by adding trainer.num_sanity_val_steps=0 the stack trace will reliably end with

# ... the initial portion of the trace is the same as before
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 602, in _apply
    param_applied = fn(param)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 719, in <lambda>
    return self._apply(lambda t: t.cpu())
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /opt/pytorch/pytorch/c10/cuda/CUDACachingAllocator.cpp:1230 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7f249b784ecc in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x1fa26 (0x7f249b7dca26 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x247 (0x7f249b7e2ec7 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x4392cc (0x7f24dbb7a2cc in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7f249b76dff5 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #5: <unknown function> + 0x3360f9 (0x7f24dba770f9 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x63c9e2 (0x7f24dbd7d9e2 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: THPVariable_subclass_dealloc(_object*) + 0x2f5 (0x7f24dbd7dd65 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #25: __libc_start_main + 0xf3 (0x7f2512d4d0b3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
titu1994 commented 2 years ago

Can you try with just CPU ? If it works at that point then it's easier to debug

itzsimpl commented 2 years ago

Unfortunately this is not possible as model.optim.name=lamb as set in fastpitch_align_v1.05.yaml, is GPU only (https://github.com/NVIDIA/apex/blob/22.03/apex/optimizers/fused_lamb.py). With adamw there is no error.

titu1994 commented 2 years ago

Hmm you could change the optimizer to something that works to see if it works (or does Adam already work?). If so then it's probably an issue in the apex lamb call?

itzsimpl commented 2 years ago

As said, changing the optimizer to adam works, so does adding a delay before starting individual processes.

It is my assumption too that it is related to the lamb call, but I don’t know enough details to pinpoint it.

titu1994 commented 2 years ago

Hmm adding a delay works... It's kinda hard to debug because it might be something with the setup, the libraries, cuda or so many other factors and this debug log doesn't really say much about the source of the issue.

For now, I'd recommend using Adam as a replacement or apply the delay. Possibly just an issue with apex or some race condition during DDP but it's not trivial to debug this

redoctopus commented 2 years ago

I've been talking to @nithinraok and it's possible that this could be caused by the PTL flags you mentioned. Can you check if the changes in #4056 help at all?

itzsimpl commented 2 years ago

No, this does not help.

titu1994 commented 2 years ago

Could you post your git hash for apex, and if it's not the latest (or the one recommended in the readme) could you upgrade your apex ?

redoctopus commented 2 years ago

As far as I'm aware multi-GPU FastPitch hasn't been fully tested, so we'll have to take a closer look at this. In the meantime you may have to keep using the sleep workaround, or switch to adam until we get this resolved.

It looks like there may be an issue with distributed lamb in apex in the 22.03 PyTorch container (NVIDIA/apex#1354), just in case you might be on that version.

itzsimpl commented 2 years ago

@titu1994 I built nemo:1.8.1 starting off of pytorch:22.03-py3, I pulled nemo from GitHub checked out tag v1.8.1 and called reinstall.sh. I did not install apex, it comes preinstalled in the pytorch:22.03-py3 container, nemo/reinstall.sh does not update it.

I can not see the git hash. Checking apex version via python, shows no information and only scarce info is provided in it's PKG-INFO

root@gpu01:/workspace# python3 -c "import apex; print(apex.__version__)"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
AttributeError: module 'apex' has no attribute '__version__'
root@gpu01:/workspace# cat /opt/conda/lib/python3.8/site-packages/apex-0.1-py3.8.egg-info/PKG-INFO 
Metadata-Version: 2.1
Name: apex
Version: 0.1
Summary: PyTorch Extensions written by NVIDIA
Home-page: UNKNOWN
License: UNKNOWN
Platform: UNKNOWN
License-File: LICENSE

UNKNOWN

@redoctopus thnx, it must be this bug then. FYI: by changing the optimiser to adamw, at least in our case, FastPitch seems to be working fine both in a multi-GPU as well as multi-node multi-GPU setting.

titu1994 commented 2 years ago

Can you try following the below step to install this git commit of apex ? Maybe it solves it but otherwise @redoctopus' suggestion is good enough for now while we debug this https://github.com/NVIDIA/NeMo#megatron-gpt

itzsimpl commented 2 years ago

No, installing apex does not help, must be the pytorch issue.

itzsimpl commented 2 years ago

Unfortuntely updating to pytorch:22.04-py3 does not resolve the issue.

redoctopus commented 2 years ago

After some discussion, we've concluded that adam makes more sense as a default optimizer for FastPitch anyway, so I'd suggest sticking with that for now. I'll push a change to switch it from lamb in the config soon.

We'll still try to resolve this bug since it shouldn't be crashing with lamb anyway. I've been able to reproduce the error locally.

redoctopus commented 2 years ago

We think it's likely an apex issue (and lamb tends to be unstable), going to close this for now. We'll keep an eye on it for the future, but our recommendation is to use adamw for FastPitch, Univnet, and Mixer TTS training.