an error when trying to infer with spec_denoiser.py

FlyToYourMooN commented 1 year ago

Thanks for your excellent work, but I encountered an error when trying to infer with python inference/tts/spec_denoiser.py

Traceback (most recent call last): File "inference/tts/spec_denoiser.py", line 272, in StutterSpeechInfer.example_run() File "inference/tts/spec_denoiser.py", line 259, in example_run wav_out, wav_gt, mel_out, mel_gt, masked_mel_out, masked_mel_gt = infer_ins.infer_once(inp) File "/data3/liukaiyang/Speech-Editing-Toolkit/inference/tts/base_tts_infer.py", line 97, in infer_once output = self.forward_model(inp) File "inference/tts/spec_denoiser.py", line 119, in forward_model output = self.model(edited_txt_tokens, time_mel_masks=time_mel_masks, mel2ph=edited_mel2ph, spk_embed=sample['spk_embed'], File "/root/anaconda3/envs/LKYBase/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/data3/liukaiyang/Speech-Editing-Toolkit/modules/speech_editing/spec_denoiser/spec_denoiser.py", line 159, in forward ret = self.fs(txt_tokens, time_mel_masks, mel2ph, spk_embed, f0, uv, energy, File "/root/anaconda3/envs/LKYBase/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, *kwargs) File "/data3/liukaiyang/Speech-Editing-Toolkit/modules/speech_editing/spec_denoiser/fs.py", line 92, in forward mel2ph = self.forward_dur(dur_inp, time_mel_masks, mel2ph, txt_tokens, ret, use_pred_mel2ph=use_pred_mel2ph) File "/data3/liukaiyang/Speech-Editing-Toolkit/modules/speech_editing/spec_denoiser/fs.py", line 137, in forward_dur masked_dur_gt = mel2token_to_dur(mel2ph(1-time_mel_masks).squeeze(-1).long(), T) * nonpadding File "/data3/liukaiyang/Speech-Editing-Toolkit/utils/audio/align.py", line 85, in mel2token_to_dur dur = mel2token.new_zeros(B, T_txt + 1).scatter_add(1, mel2token, torch.ones_like(mel2token)) RuntimeError: index 86 is out of bounds for dimension 1 with size 86

Zain-Jiang commented 1 year ago

I'm sorry that the inference code is somewhat hard-to-read (especially for the ways to decide edited_word_idx and changed_idx). Maybe you can tell me your text edited_text edited_word_idx changed_idx and the audio you want to edit to help us find the bugs.

We will publish a more user-friendly script to infer with one example in a few days.

FlyToYourMooN commented 1 year ago

1.I use inference/tts/spec_denoiser.py. I try to input an audio and the corresponding text as an example： 2.When doing speech editing, I need to use MFA to get its corresponding duration But when I execute the mfa align command, an error occurs 3.I try to skip this step because before training, each audio in the dataset has a corresponding textgrid file. I can directly use the textgrid file corresponding to the audio 4.So I commented out the mfa align command and called the audio textgrid file directly, but I got an error in the step of getting mel2ph and mel2word

FlyToYourMooN commented 1 year ago

Maybe I made mistakes in the operation. thank you very much for your work. The diffusion model shows excellent performance in batch inference!I look forward to your inference script for one sample. These codes really make me dizzy :confounded:

Zain-Jiang commented 1 year ago

In the step 4 of your reply, I think the error may happen in the process of constructing a new mel2ph for the operation of replacing Six spoons ... with Nine spoons .... A moment ago, I successfully run the same example p236_003 with the same replacement operation. Maybe you can set edited_word_idx = 2 changed_idx = [1,3] and re-run the script to get a temporary solution. (The number starts from 1 and the token is the first word. So the idx of the word Six is 2. Besides, the space token also counts).

Ha-ha-ha, sorry again for making you dizzy. We will inform you immediately when we publish our new inference script for one sample.

FlyToYourMooN commented 1 year ago

I got it!thank you very much!

Zain-Jiang commented 1 year ago

@FlyToYourMooN Hello, I have pushed an easy inference example in our repo! Sorry to keep you waiting!

Zain-Jiang / Speech-Editing-Toolkit

an error when trying to infer with spec_denoiser.py #2