A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
speaker_reco_infer.py loads the model and manifestfiles and then breaks,
I guess its again a pytorch issue? wanted to use the model from yesterday:
[NeMo W 2021-09-17 12:18:42 patch_utils:49] torch.stft() signature has been updated for PyTorch 1.7+
Please update PyTorch to remain compatible with later versions of NeMo.
RuntimeError: Argument #4: Padding size should be less than the corresponding input dimension, but got: padding (256, 256) at dimension 2 of input [1, 1, 2]
A little bit more explanation to the inference part would be nice,
like a link to the script that i was using here, and also how to use the embedding that is created at the end of the jupyter notebook
Describe the bug
speaker_reco_infer.py loads the model and manifestfiles and then breaks, I guess its again a pytorch issue? wanted to use the model from yesterday:
[NeMo W 2021-09-17 12:18:42 patch_utils:49] torch.stft() signature has been updated for PyTorch 1.7+ Please update PyTorch to remain compatible with later versions of NeMo.
RuntimeError: Argument #4: Padding size should be less than the corresponding input dimension, but got: padding (256, 256) at dimension 2 of input [1, 1, 2]
Steps/Code to reproduce bug
Container: nvcr.io/nvidia/nemo:1.2.0
but installed: python -m pip install git+https://github.com/NVIDIA/NeMo.git@'main' python -m pip install pytorch_lightning==1.4.2
(like used/working in the goolge colab, I also tried nemo 1.2 and pytorch-lightning 1.3.8 and nemo 1.3 and recent 1.4.7 later on)
run: https://github.com/NVIDIA/NeMo/blob/48fe9e69feba7651694fd6ae0a096a0655ed601c/examples/speaker_tasks/recognition/speaker_reco_infer.py
with: model, train.json from:
https://colab.research.google.com/github/NVIDIA/NeMo/blob/main/tutorials/speaker_tasks/Speaker_Identification_Verification.ipynb
added: test.json and bonian.wav {"audio_filepath": "bonian.wav", "offset": 0, "duration": 11.370666666666667, "label": ""}
Expected behavior
working :-)
Environment overview (please complete the following information)
docker pull
&docker run
commands usedEnvironment details
If NVIDIA docker image is used you don't need to specify these. Otherwise, please provide:
Additional context https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/speaker_recognition/results.html
A little bit more explanation to the inference part would be nice, like a link to the script that i was using here, and also how to use the embedding that is created at the end of the jupyter notebook