NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.84k stars 2.46k forks source link

RuntimeError: shape '[16, -1, 128]' is invalid for input of size 236544 #5836

Closed sid-mr-im closed 1 year ago

sid-mr-im commented 1 year ago

Description Whenever the pre-training is initiated with script speech_pre_training.py with default configuration file conformer_ssl.yaml configuration, the following error is encountered:

File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt return trainer_fn(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl self._run(model, ckpt_path=self.ckpt_path) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1098, in _run results = self._run_stage() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1177, in _run_stage self._run_train() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1200, in _run_train self.fit_loop.run() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance self._outputs = self.epoch_loop.run(self._data_fetcher) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/loop.py", line 200, in run self.on_advance_end() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 251, in on_advance_end self._run_validation() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 310, in _run_validation self.val_loop.run() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 152, in advance dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 137, in advance output = self._evaluation_step(kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 234, in _evaluation_step output = self.trainer._call_strategy_hook(hook_name, kwargs.values()) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1480, in _call_strategy_hook output = fn(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/ddp.py", line 360, in validation_step return self.model(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1040, in forward output = self._run_ddp_forward(*inputs, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward return module_to_run(*inputs[0], *kwargs[0]) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/overrides/base.py", line 110, in forward return self._forward_module.validation_step(*inputs, kwargs) File "/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/models/ssl_models.py", line 533, in validation_step lossvalue, = self.decoder_loss_step(spectrograms, spec_masks, encoded, encoded_len, targets, target_lengths) File "/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/models/ssl_models.py", line 462, in decoder_loss_step current_loss_value = current_loss( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/usr/local/lib/python3.10/dist-packages/nemo/core/classes/common.py", line 1086, in call outputs = wrapped(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/losses/ssl_losses/contrastive.py", line 187, in forward out_masked_only = out_masked_only.reshape(bs, -1, out_masked_only.shape[-1]) RuntimeError: shape '[16, -1, 128]' is invalid for input of size 236544

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/nemocker/scripts/speech_pre_training.py", line 68, in main trainer.fit(asr_model) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit call._call_and_handle_interrupt( File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 59, in _call_and_handle_interrupt trainer.strategy.reconciliate_processes(traceback.format_exc()) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/ddp.py", line 461, in reconciliate_processes raise DeadlockDetectedException(f"DeadLock detected from rank: {self.global_rank} \n {trace}") pytorch_lightning.utilities.exceptions.DeadlockDetectedException: DeadLock detected from rank: 2 Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt return trainer_fn(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl self._run(model, ckpt_path=self.ckpt_path) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1098, in _run results = self._run_stage() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1177, in _run_stage self._run_train() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1200, in _run_train self.fit_loop.run() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance self._outputs = self.epoch_loop.run(self._data_fetcher) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/loop.py", line 200, in run self.on_advance_end() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 251, in on_advance_end self._run_validation() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 310, in _run_validation self.val_loop.run() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 152, in advance dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 137, in advance output = self._evaluation_step(kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 234, in _evaluation_step output = self.trainer._call_strategy_hook(hook_name, kwargs.values()) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1480, in _call_strategy_hook output = fn(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/ddp.py", line 360, in validation_step return self.model(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1040, in forward output = self._run_ddp_forward(*inputs, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward return module_to_run(*inputs[0], *kwargs[0]) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/overrides/base.py", line 110, in forward return self._forward_module.validation_step(*inputs, kwargs) File "/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/models/ssl_models.py", line 533, in validation_step lossvalue, = self.decoder_loss_step(spectrograms, spec_masks, encoded, encoded_len, targets, target_lengths) File "/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/models/ssl_models.py", line 462, in decoder_loss_step current_loss_value = current_loss( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/usr/local/lib/python3.10/dist-packages/nemo/core/classes/common.py", line 1086, in call outputs = wrapped(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/losses/ssl_losses/contrastive.py", line 187, in forward out_masked_only = out_masked_only.reshape(bs, -1, out_masked_only.shape[-1]) RuntimeError: shape '[16, -1, 128]' is invalid for input of size 236544

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Please note, the python version is Python 3.10.6 and every audio file in the training set has above 100 frames.

Steps/Code to reproduce bug

After creating a manifest with audio files (above 100 frames), default speech_pre_training.py script is run with default configuration conformer_ssl.py

Environment overview

titu1994 commented 1 year ago

@sam1373 can you take a look ?

sam1373 commented 1 year ago

@sid-mr-im could you tell me what the minimum and maximum durations of files in your dataset are?

sid-mr-im commented 1 year ago

Hi @sam1373, the min duration is 0.006375 secs and the maximum duration is 48.82 secs. But, since I used the default conformer_ssl.yaml config, it did filter out most of the audios because, in it, the min_duration is 8.0

sam1373 commented 1 year ago

Hi @sid-mr-im, so far I can't seem to reproduce this, but it might be related to the difference in minimum and maximum durations. Could you try to either separate your dataset into several buckets based on duration or cut out segments of specific duration during training (see here) and tell me if you still get this issue. Either way it is recommended to not have this big of a difference in duration since your batches will be padded to longest utterance.

sam1373 commented 1 year ago

Also, if your dataset mostly consists of shorter audios, you can lower the minimum duration/cut out shorter segments, but you will need to either decrease the number of sampled negatives or set sample_from_same_utterance_only to false in the config.

sid-mr-im commented 1 year ago

Sure @sam1373! Let me see if that solves the problem.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 1 year ago

This issue was closed because it has been inactive for 7 days since being marked as stale.