NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.47k stars 2.4k forks source link

RuntimeError: shape '[16, -1, 128]' is invalid for input of size 113920 #7203

Closed hruturajnikam closed 11 months ago

hruturajnikam commented 1 year ago

@titu1994 @meghmak13 Dear Team, I am training conformer based SSL model where I have configured the parameters min_duration : 3.2 patch_size: 16 mask_patches: 0.5 num_negatives: 40 sample_from_same_utterance_only: true sample_from_non_masked: false codebook_size: 300 num_groups: 2 num_classes: 90000

but I am getting following error File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1129, in _call_impl return forward_call(*input, kwargs) File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 987, in forward output = self.module(*inputs[0], *kwargs[0]) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1129, in _call_impl return forward_call(input, kwargs) File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/overrides/base.py", line 82, in forward output = self.module.training_step(*inputs, kwargs) File "/opt/conda/lib/python3.8/site-packages/nemo/utils/model_utils.py", line 364, in wrap_training_step output_dict = wrapped(*args, *kwargs) File "/opt/conda/lib/python3.8/site-packages/nemo/collections/asr/models/ssl_models.py", line 468, in training_step loss_value, loss_val_dict = self.decoder_loss_step( File "/opt/conda/lib/python3.8/site-packages/nemo/collections/asr/models/ssl_models.py", line 450, in decoder_loss_step current_loss_value = current_loss( File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1129, in _call_impl return forward_call(input, kwargs) File "/opt/conda/lib/python3.8/site-packages/nemo/core/classes/common.py", line 963, in call outputs = wrapped(*args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/nemo/collections/asr/losses/ssl_losses/contrastive.py", line 187, in forward out_masked_only = out_masked_only.reshape(bs, -1, out_masked_only.shape[-1]) RuntimeError: shape '[16, -1, 128]' is invalid for input of size 113920

I am using Nemo-22.05 images(ngc) and 48, A100-SXM4-40GB GPU i.e 6 nodes for the training. Kindly suggest me what is need to be done ? Thanks in advance

hruturajnikam commented 1 year ago

@sam1373 @titu1994 any updates on the issue ?

github-actions[bot] commented 11 months ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 11 months ago

This issue was closed because it has been inactive for 7 days since being marked as stale.