Closed itzsimpl closed 2 years ago
A little update. Fiddling with fastpitch_align_v1.05.yml
is seems that the issue arises from the default optimiser. Indeed the default optimiser is lamb
, and as soon as I change it to e.g. adamw
, I can safely run on multiple GPUs.
Changing the optimizer solves running on a single node with multiple GPUs, but trying to run multi-node multi-gpu distributed training again results in a crash, this time with the following trace
Traceback (most recent call last):
File "examples/tts/fastpitch_finetune.py", line 41, in <module>
main() # noqa pylint: disable=no-value-for-parameter
File "/workspace/nemo/nemo/core/config/hydra_runner.py", line 104, in wrapper
_run_hydra(
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 377, in _run_hydra
run_and_report(
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 214, in run_and_report
raise ex
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
return func()
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 378, in <lambda>
lambda: hydra.run(
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 111, in run
_ = ret.return_value
File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 233, in return_value
raise self._return_value
File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 160, in run_job
ret.return_value = task_function(task_cfg)
File "examples/tts/fastpitch_finetune.py", line 37, in main
trainer.fit(model)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 768, in fit
self._call_and_handle_interrupt(
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 809, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1170, in _run
self.__setup_profiler()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1795, in __setup_profiler
self.profiler.setup(stage=self.state.fn._setup_fn, local_rank=local_rank, log_dir=self.log_dir)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 2232, in log_dir
dirpath = self.strategy.broadcast(dirpath)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 311, in broadcast
torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1872, in broadcast_object_list
broadcast(object_sizes_tensor, src=src, group=group)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1188, in broadcast
work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, unhandled system error, NCCL version 21.2.9
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
I'm running on SLURM 21.08.5, nvidia drivers are 510.47.03, nemo:1.8.0 was built by cloning the repo into pytorch:22.03-py3 and running reinstall.sh
.
Can you provide the commands you were using to run the script, and any config changes? Are there any extra error logs on the other node(s)?
Also, I believe PyTorch Lightning has some trouble if the nodes have a different number of GPUs. Can you check if this is the case?
This specific trace comes from a cluster with 2 nodes, both with 8 GPUs. They are different nodes, with a different network, CPU, RAM and GPU type config. On this cluster I have tried running also ASR multi-node multi-gpu training, and the result is the same.
The command I am running is as follows
export NCCL_IB_DISABLE=1
export NCCL_DEBUG=INFO
# train
python examples/tts/fastpitch_finetune.py \
--config-path=conf \
--config-name=fastpitch_align_v1.05 \
exp_manager.name=fastpitch \
+exp_manager.exp_dir=/experiments \
+exp_manager.version=20220421-0818 \
exp_manager.resume_if_exists=true \
exp_manager.resume_ignore_no_checkpoint=true \
+exp_manager.checkpoint_callback_params.save_best_model=true \
train_dataset=/data/manifests/manifest_v4_train.json \
validation_datasets=/data/manifests/manifest_v4_valid.json \
sup_data_path=/data/sup/v4 \
~phoneme_dict_path \
~heteronyms_path \
~whitelist_path \
~model.text_normalizer \
~model.text_normalizer_call_kwargs \
~model.text_tokenizer \
+model.text_tokenizer='{_target_:nemo.collections.tts.torch.tts_tokenizers.EnglishCharsTokenizer,apostrophe:true,pad_with_space:true}' \
model.train_ds.dataloader_params.batch_size=24 \
model.train_ds.dataloader_params.num_workers=4 \
model.validation_ds.dataloader_params.batch_size=24 \
model.validation_ds.dataloader_params.num_workers=4 \
model.optim.name=adamw \
trainer.devices=-1 \
trainer.num_nodes=2 \
trainer.precision=32 \
and this is the corresponding log output
[NeMo W 2022-04-21 08:18:55 nemo_logging:349] /opt/conda/lib/python3.8/site-packages/pydub/utils.py:170: RuntimeWarning: Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work
warn("Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work", RuntimeWarning)
[NeMo W 2022-04-21 08:18:56 experimental:27] Module <class 'nemo.collections.nlp.data.language_modeling.megatron.megatron_batch_samplers.MegatronPretrainingRandomBatchSampler'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2022-04-21 08:18:57 fastpitch_finetune:27] You are using an optimizer scheduler while finetuning. Are you sure this is intended?
[NeMo W 2022-04-21 08:18:57 fastpitch_finetune:29] The recommended learning rate for finetuning is 2e-4
Multiprocessing is handled by SLURM.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[NeMo E 2022-04-21 08:18:57 exp_manager:368] You are running multi-node training without SLURM handling the processes. Please note that this is not tested in NeMo and could result in errors.
[NeMo W 2022-04-21 08:18:57 exp_manager:409] There was no checkpoint folder at checkpoint_dir :/experiments/fastpitch/20220421-0818/checkpoints. Training from scratch.
[NeMo I 2022-04-21 08:18:57 exp_manager:281] Experiments will be logged at /experiments/fastpitch/20220421-0818
[NeMo I 2022-04-21 08:18:57 exp_manager:647] TensorboardLogger has been set up
[NeMo W 2022-04-21 08:18:57 nemo_logging:349] /opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:2302: LightningDeprecationWarning: `Trainer.weights_save_path` has been deprecated in v1.6 and will be removed in v1.8.
rank_zero_deprecation("`Trainer.weights_save_path` has been deprecated in v1.6 and will be removed in v1.8.")
Created a temporary directory at /tmp/tmpo20qduxs
Writing /tmp/tmpo20qduxs/_remote_module_non_sriptable.py
[NeMo I 2022-04-21 08:18:57 data:173] Loading dataset from /data/manifests/manifest_v4_train.json.
[NeMo I 2022-04-21 08:18:58 data:207] Loaded dataset with 10322 files.
[NeMo I 2022-04-21 08:18:58 data:209] Dataset contains 16.66 hours.
[NeMo I 2022-04-21 08:18:58 data:297] Pruned 0 files. Final dataset contains 10322 files
[NeMo I 2022-04-21 08:18:58 data:299] Pruned 0.00 hours. Final dataset contains 16.66 hours.
[NeMo I 2022-04-21 08:18:58 data:173] Loading dataset from /data/manifests/manifest_v4_valid.json.
0it [00:00, ?it/s]
4055it [00:00, 40546.46it/s]
8221it [00:00, 41197.27it/s]
10322it [00:00, 41361.45it/s]
[NeMo I 2022-04-21 08:18:58 data:207] Loaded dataset with 543 files.
[NeMo I 2022-04-21 08:18:58 data:209] Dataset contains 0.84 hours.
[NeMo I 2022-04-21 08:18:58 data:297] Pruned 0 files. Final dataset contains 543 files
[NeMo I 2022-04-21 08:18:58 data:299] Pruned 0.00 hours. Final dataset contains 0.84 hours.
[NeMo I 2022-04-21 08:18:58 features:259] PADDING: 1
[NeMo I 2022-04-21 08:18:58 features:276] STFT using torch
0it [00:00, ?it/s]
543it [00:00, 34682.67it/s]
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
Created a temporary directory at /tmp/tmpmy42n7nr
Writing /tmp/tmpmy42n7nr/_remote_module_non_sriptable.py
0it [00:00, ?it/s]
3771it [00:00, 37706.53it/s]
7542it [00:00, 37416.53it/s]
10322it [00:00, 37234.35it/s]
0it [00:00, ?it/s]
543it [00:00, 22735.51it/s]
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
Added key: store_based_barrier_key:1 to store for rank: 1
Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------
Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
gpu01:596387:596387 [0] NCCL INFO Bootstrap : Using enp225s0f1:10.10.10.2<0>
gpu01:596387:596387 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v5 symbol.
gpu01:596387:596387 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v5 symbol.
gpu01:596387:596387 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
gpu01:596387:596387 [0] NCCL INFO P2P plugin IBext
gpu01:596387:596387 [0] NCCL INFO NET/IB : Using [0]mlx5_9:1/RoCE ; OOB enp225s0f1:10.10.10.2<0>
gpu01:596387:596387 [0] NCCL INFO Using network IBext
NCCL version 2.12.9+cuda11.6
gpu02:693036:693036 [0] NCCL INFO Bootstrap : Using eno1:10.10.10.3<0>
gpu02:693036:693036 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v5 symbol.
gpu02:693036:693036 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v5 symbol.
gpu02:693036:693036 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
gpu02:693036:693036 [0] NCCL INFO P2P plugin IBext
gpu02:693036:693036 [0] NCCL INFO NET/IB : No device found.
gpu02:693036:693036 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
gpu02:693036:693036 [0] NCCL INFO NET/Socket : Using [0]eno1:10.10.10.3<0> [1]vethd257813:fe80::7801:5ff:fe18:2cbc%vethd257813<0> [2]veth3fb8ef2:fe80::dc58:96ff:fe78:44b%veth3fb8ef2<0>
gpu02:693036:693036 [0] NCCL INFO Using network Socket
gpu02:693036:693058 [0] NCCL INFO Setting affinity for GPU 0 to 03,00000003
gpu01:596387:597126 [0] NCCL INFO PXN Disabled as plugin is v4
gpu01:596387:597126 [0] NCCL INFO Channel 00/02 : 0 1
gpu01:596387:597126 [0] NCCL INFO Channel 01/02 : 0 1
gpu01:596387:597126 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1
gpu02:693036:693058 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1
gpu01:596387:597126 [0] NCCL INFO Channel 00/0 : 1[1b000] -> 0[7000] [receive] via NET/IBext/0
gpu02:693036:693058 [0] NCCL INFO Channel 00/0 : 0[7000] -> 1[1b000] [receive] via NET/Socket/1
gpu02:693036:693058 [0] NCCL INFO Channel 01/0 : 0[7000] -> 1[1b000] [receive] via NET/Socket/1
gpu02:693036:693058 [0] NCCL INFO Channel 00/0 : 1[1b000] -> 0[7000] [send] via NET/Socket/1
gpu02:693036:693058 [0] NCCL INFO Channel 01/0 : 1[1b000] -> 0[7000] [send] via NET/Socket/1
gpu01:596387:597126 [0] NCCL INFO Channel 01/0 : 1[1b000] -> 0[7000] [receive] via NET/IBext/0
gpu01:596387:597126 [0] NCCL INFO Channel 00/0 : 0[7000] -> 1[1b000] [send] via NET/IBext/0
gpu01:596387:597126 [0] NCCL INFO Channel 01/0 : 0[7000] -> 1[1b000] [send] via NET/IBext/0
gpu01:596387:597129 [0] ../include/socket.h:407 NCCL WARN Connect to fe80::7801:5ff:fe18:2cbc%ibp75s0<45553> failed : Network is unreachable
gpu01:596387:597129 [0] NCCL INFO ib_plugin.c:266 -> 2
gpu01:596387:597129 [0] NCCL INFO include/net.h:25 -> 2
gpu01:596387:597129 [0] NCCL INFO transport/net.cc:515 -> 2
gpu01:596387:597129 [0] NCCL INFO proxy.cc:914 -> 2
gpu01:596387:597129 [0] NCCL INFO proxy.cc:942 -> 2
gpu01:596387:597129 [0] proxy.cc:1042 NCCL WARN [Proxy Service 0] Failed to execute operation Connect from rank 0, retcode 2
gpu01:596387:597129 [0] ../include/socket.h:407 NCCL WARN Connect to fe80::7801:5ff:fe18:2cbc%ibp75s0<45553> failed : Network is unreachable
gpu01:596387:597129 [0] NCCL INFO ib_plugin.c:266 -> 2
gpu01:596387:597129 [0] NCCL INFO include/net.h:25 -> 2
gpu01:596387:597129 [0] NCCL INFO transport/net.cc:515 -> 2
gpu01:596387:597129 [0] NCCL INFO proxy.cc:914 -> 2
gpu01:596387:597129 [0] proxy.cc:1042 NCCL WARN [Proxy Service 0] Failed to execute operation Connect from rank 0, retcode 2
gpu01:596387:597126 [0] misc/socket.cc:523 NCCL WARN Net : Connection closed by remote peer gpu01<52753>
gpu01:596387:597126 [0] NCCL INFO misc/socket.cc:531 -> 2
gpu01:596387:597126 [0] NCCL INFO misc/socket.cc:543 -> 2
gpu01:596387:597126 [0] NCCL INFO proxy.cc:805 -> 2
gpu01:596387:597126 [0] proxy.cc:808 NCCL WARN Proxy Call to rank 0 failed (Connect)
gpu01:596387:597126 [0] NCCL INFO transport/net.cc:269 -> 2
gpu01:596387:597126 [0] NCCL INFO transport.cc:127 -> 2
gpu01:596387:597126 [0] NCCL INFO init.cc:730 -> 2
gpu01:596387:597126 [0] NCCL INFO init.cc:915 -> 2
gpu01:596387:597126 [0] NCCL INFO group.cc:57 -> 2 [Async thread]
Error executing job with overrides: ['exp_manager.name=fastpitch', '+exp_manager.exp_dir=/experiments', '+exp_manager.version=20220421-0818', 'exp_manager.resume_if_exists=true', 'exp_manager.resume_ignore_no_checkpoint=true', '+exp_manager.checkpoint_callback_params.save_best_model=true', 'train_dataset=/data/manifests/manifest_v4_train.json', 'validation_datasets=/data/manifests/manifest_v4_valid.json', 'sup_data_path=/data/sup/v4', '~phoneme_dict_path', '~heteronyms_path', '~whitelist_path', '~model.text_normalizer', '~model.text_normalizer_call_kwargs', '~model.text_tokenizer', '+model.text_tokenizer={_target_:nemo.collections.tts.torch.tts_tokenizers.EnglishCharsTokenizer,apostrophe:true,pad_with_space:true}', 'model.train_ds.dataloader_params.batch_size=24', 'model.train_ds.dataloader_params.num_workers=4', 'model.validation_ds.dataloader_params.batch_size=24', 'model.validation_ds.dataloader_params.num_workers=4', 'model.optim.name=adamw', 'trainer.devices=-1', 'trainer.num_nodes=2', 'trainer.precision=32']
Traceback (most recent call last):
File "examples/tts/fastpitch_finetune.py", line 41, in <module>
main() # noqa pylint: disable=no-value-for-parameter
File "/workspace/nemo/nemo/core/config/hydra_runner.py", line 104, in wrapper
_run_hydra(
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 377, in _run_hydra
run_and_report(
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 214, in run_and_report
raise ex
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
return func()
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 378, in <lambda>
lambda: hydra.run(
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 111, in run
_ = ret.return_value
File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 233, in return_value
raise self._return_value
File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 160, in run_job
ret.return_value = task_function(task_cfg)
File "examples/tts/fastpitch_finetune.py", line 37, in main
trainer.fit(model)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 768, in fit
self._call_and_handle_interrupt(
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 809, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1170, in _run
self.__setup_profiler()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1795, in __setup_profiler
self.profiler.setup(stage=self.state.fn._setup_fn, local_rank=local_rank, log_dir=self.log_dir)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 2232, in log_dir
dirpath = self.strategy.broadcast(dirpath)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 311, in broadcast
torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1872, in broadcast_object_list
broadcast(object_sizes_tensor, src=src, group=group)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1188, in broadcast
work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, unhandled system error, NCCL version 21.2.9
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
Multi-node multi-gpu training works if I set the following NCCL env variables:
export NCCL_IB_DISABLE=1
export NCCL_IBEXT_DISABLE=1
The same can be achieved if setting only
export NCCL_NET=Socket
but with NCCL 2.12.9+cuda11.6 this leads to seg fault, so until patch https://github.com/NVIDIA/nccl/issues/676#issuecomment-1106254219 gets merged, the first solution is preferred.
I'm reopening this issue just as a reminder that when using model.optim.name=lamb
training still crashes.
Can you try running with CUDA_LAUNCH_BLOCKING=1
? It may give a more informative stack trace.
Unfortunately there's not much more information
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 809, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1234, in _run
results = self._run_stage()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1321, in _run_stage
return self._run_train()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1351, in _run_train
self.fit_loop.run()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 269, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 208, in advance
batch_output = self.batch_loop.run(batch, batch_idx)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance
outputs = self.optimizer_loop.run(split_batch, optimizers, batch_idx)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 203, in advance
result = self._run_optimization(
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 256, in _run_optimization
self._optimizer_step(optimizer, opt_idx, batch_idx, closure)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 369, in _optimizer_step
self.trainer._call_lightning_module_hook(
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1593, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 1625, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 168, in step
step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 278, in optimizer_step
optimizer_output = super().optimizer_step(optimizer, opt_idx, closure, model, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 193, in optimizer_step
return self.precision_plugin.optimizer_step(model, optimizer, opt_idx, closure, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 155, in optimizer_step
return optimizer.step(closure=closure, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 65, in wrapper
return wrapped(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/optim/optimizer.py", line 88, in wrapper
return func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/apex/optimizers/fused_lamb.py", line 124, in step
g_norm_32 = multi_tensor_applier(self.multi_tensor_l2norm,
File "/opt/conda/lib/python3.8/site-packages/apex/multi_tensor_apply/multi_tensor_apply.py", line 27, in __call__
return op(self.chunk_size,
RuntimeError: CUDA error: an illegal memory access was encountered
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "examples/tts/fastpitch_finetune.py", line 41, in <module>
main() # noqa pylint: disable=no-value-for-parameter
File "/workspace/nemo/nemo/core/config/hydra_runner.py", line 104, in wrapper
_run_hydra(
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 377, in _run_hydra
run_and_report(
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 214, in run_and_report
raise ex
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
return func()
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 378, in <lambda>
lambda: hydra.run(
File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 111, in run
_ = ret.return_value
File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 233, in return_value
raise self._return_value
File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 160, in run_job
ret.return_value = task_function(task_cfg)
File "examples/tts/fastpitch_finetune.py", line 37, in main
trainer.fit(model)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 768, in fit
self._call_and_handle_interrupt(
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 736, in _call_and_handle_interrupt
self._teardown()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1298, in _teardown
self.strategy.teardown()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 474, in teardown
self.lightning_module.cpu()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/core/mixins/device_dtype_mixin.py", line 147, in cpu
return super().cpu()
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 719, in cpu
return self._apply(lambda t: t.cpu())
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 579, in _apply
module._apply(fn)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 579, in _apply
module._apply(fn)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 579, in _apply
module._apply(fn)
[Previous line repeated 1 more time]
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 602, in _apply
param_applied = fn(param)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 719, in <lambda>
return self._apply(lambda t: t.cpu())
RuntimeError: CUDA error: an illegal memory access was encountered
While testing I have, however, discovered also that if I add a small sleep to the script that is passed to SLURM immediately before calling python it suddenly starts working. The sleep is however HW dependent, on 2xA5000 GPUs 5s were enough on 2xA100 SXM4 GPUS 15s were required.
Btw. are these sections of code still valid? https://github.com/NVIDIA/NeMo/blob/d9a329edf82ffed2cdc3ec1ca00539e90cbe7e3f/nemo/core/classes/modelPT.py#L486-L500 and https://github.com/NVIDIA/NeMo/blob/d9a329edf82ffed2cdc3ec1ca00539e90cbe7e3f/nemo/collections/tts/helpers/helpers.py#L93-L104
A recent change in PTL flags (https://github.com/NVIDIA/NeMo/pull/3589) renamed accelerator
(dp, ddp, ddp2,...) to strategy
. Since then accelerator
holds gpu, cpu, ...
I have managed to get a little bit more trace. Just by adding trainer.num_sanity_val_steps=0
the stack trace will reliably end with
# ... the initial portion of the trace is the same as before
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 602, in _apply
param_applied = fn(param)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 719, in <lambda>
return self._apply(lambda t: t.cpu())
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /opt/pytorch/pytorch/c10/cuda/CUDACachingAllocator.cpp:1230 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7f249b784ecc in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x1fa26 (0x7f249b7dca26 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x247 (0x7f249b7e2ec7 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x4392cc (0x7f24dbb7a2cc in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7f249b76dff5 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #5: <unknown function> + 0x3360f9 (0x7f24dba770f9 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x63c9e2 (0x7f24dbd7d9e2 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: THPVariable_subclass_dealloc(_object*) + 0x2f5 (0x7f24dbd7dd65 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #25: __libc_start_main + 0xf3 (0x7f2512d4d0b3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
Can you try with just CPU ? If it works at that point then it's easier to debug
Unfortunately this is not possible as model.optim.name=lamb
as set in fastpitch_align_v1.05.yaml, is GPU only (https://github.com/NVIDIA/apex/blob/22.03/apex/optimizers/fused_lamb.py). With adamw
there is no error.
Hmm you could change the optimizer to something that works to see if it works (or does Adam already work?). If so then it's probably an issue in the apex lamb call?
As said, changing the optimizer to adam works, so does adding a delay before starting individual processes.
It is my assumption too that it is related to the lamb call, but I don’t know enough details to pinpoint it.
Hmm adding a delay works... It's kinda hard to debug because it might be something with the setup, the libraries, cuda or so many other factors and this debug log doesn't really say much about the source of the issue.
For now, I'd recommend using Adam as a replacement or apply the delay. Possibly just an issue with apex or some race condition during DDP but it's not trivial to debug this
I've been talking to @nithinraok and it's possible that this could be caused by the PTL flags you mentioned. Can you check if the changes in #4056 help at all?
No, this does not help.
Could you post your git hash for apex, and if it's not the latest (or the one recommended in the readme) could you upgrade your apex ?
As far as I'm aware multi-GPU FastPitch hasn't been fully tested, so we'll have to take a closer look at this. In the meantime you may have to keep using the sleep workaround, or switch to adam until we get this resolved.
It looks like there may be an issue with distributed lamb in apex in the 22.03 PyTorch container (NVIDIA/apex#1354), just in case you might be on that version.
@titu1994 I built nemo:1.8.1
starting off of pytorch:22.03-py3
, I pulled nemo from GitHub checked out tag v1.8.1
and called reinstall.sh
. I did not install apex, it comes preinstalled in the pytorch:22.03-py3
container, nemo/reinstall.sh
does not update it.
I can not see the git hash. Checking apex version via python, shows no information and only scarce info is provided in it's PKG-INFO
root@gpu01:/workspace# python3 -c "import apex; print(apex.__version__)"
Traceback (most recent call last):
File "<string>", line 1, in <module>
AttributeError: module 'apex' has no attribute '__version__'
root@gpu01:/workspace# cat /opt/conda/lib/python3.8/site-packages/apex-0.1-py3.8.egg-info/PKG-INFO
Metadata-Version: 2.1
Name: apex
Version: 0.1
Summary: PyTorch Extensions written by NVIDIA
Home-page: UNKNOWN
License: UNKNOWN
Platform: UNKNOWN
License-File: LICENSE
UNKNOWN
@redoctopus thnx, it must be this bug then. FYI: by changing the optimiser to adamw
, at least in our case, FastPitch seems to be working fine both in a multi-GPU as well as multi-node multi-GPU setting.
Can you try following the below step to install this git commit of apex ? Maybe it solves it but otherwise @redoctopus' suggestion is good enough for now while we debug this https://github.com/NVIDIA/NeMo#megatron-gpt
No, installing apex does not help, must be the pytorch issue.
Unfortuntely updating to pytorch:22.04-py3
does not resolve the issue.
After some discussion, we've concluded that adam makes more sense as a default optimizer for FastPitch anyway, so I'd suggest sticking with that for now. I'll push a change to switch it from lamb in the config soon.
We'll still try to resolve this bug since it shouldn't be crashing with lamb anyway. I've been able to reproduce the error locally.
We think it's likely an apex issue (and lamb tends to be unstable), going to close this for now. We'll keep an eye on it for the future, but our recommendation is to use adamw for FastPitch, Univnet, and Mixer TTS training.
When I try to run
experiments/tts/fastpitch_finetune.py
withfastpitch_align_v1.05.yml
withddp
on multiple GPUs the training immediately crashes with the following traceThis does not happen if I run on a single GPU.