Closed FlyToYourMooN closed 1 year ago
I'm sorry that the inference code is somewhat hard-to-read (especially for the ways to decide edited_word_idx
and changed_idx
). Maybe you can tell me your text
edited_text
edited_word_idx
changed_idx
and the audio you want to edit to help us find the bugs.
We will publish a more user-friendly script to infer with one example in a few days.
1.I use inference/tts/spec_denoiser.py. I try to input an audio and the corresponding text as an example: 2.When doing speech editing, I need to use MFA to get its corresponding duration But when I execute the mfa align command, an error occurs 3.I try to skip this step because before training, each audio in the dataset has a corresponding textgrid file. I can directly use the textgrid file corresponding to the audio 4.So I commented out the mfa align command and called the audio textgrid file directly, but I got an error in the step of getting mel2ph and mel2word
Maybe I made mistakes in the operation. thank you very much for your work. The diffusion model shows excellent performance in batch inference!I look forward to your inference script for one sample. These codes really make me dizzy :confounded:
In the step 4 of your reply, I think the error may happen in the process of constructing a new mel2ph
for the operation of replacing Six spoons ...
with Nine spoons ...
. A moment ago, I successfully run the same example p236_003
with the same replacement operation. Maybe you can set edited_word_idx = 2
changed_idx = [1,3]
and re-run the script to get a temporary solution. (The number starts from 1 and the Six
is 2. Besides, the space token also counts).
Ha-ha-ha, sorry again for making you dizzy. We will inform you immediately when we publish our new inference script for one sample.
I got it!thank you very much!
@FlyToYourMooN Hello, I have pushed an easy inference example in our repo! Sorry to keep you waiting!
Thanks for your excellent work, but I encountered an error when trying to infer with python inference/tts/spec_denoiser.py
Traceback (most recent call last): File "inference/tts/spec_denoiser.py", line 272, in
StutterSpeechInfer.example_run()
File "inference/tts/spec_denoiser.py", line 259, in example_run
wav_out, wav_gt, mel_out, mel_gt, masked_mel_out, masked_mel_gt = infer_ins.infer_once(inp)
File "/data3/liukaiyang/Speech-Editing-Toolkit/inference/tts/base_tts_infer.py", line 97, in infer_once
output = self.forward_model(inp)
File "inference/tts/spec_denoiser.py", line 119, in forward_model
output = self.model(edited_txt_tokens, time_mel_masks=time_mel_masks, mel2ph=edited_mel2ph, spk_embed=sample['spk_embed'],
File "/root/anaconda3/envs/LKYBase/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, *kwargs)
File "/data3/liukaiyang/Speech-Editing-Toolkit/modules/speech_editing/spec_denoiser/spec_denoiser.py", line 159, in forward
ret = self.fs(txt_tokens, time_mel_masks, mel2ph, spk_embed, f0, uv, energy,
File "/root/anaconda3/envs/LKYBase/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(input, *kwargs)
File "/data3/liukaiyang/Speech-Editing-Toolkit/modules/speech_editing/spec_denoiser/fs.py", line 92, in forward
mel2ph = self.forward_dur(dur_inp, time_mel_masks, mel2ph, txt_tokens, ret, use_pred_mel2ph=use_pred_mel2ph)
File "/data3/liukaiyang/Speech-Editing-Toolkit/modules/speech_editing/spec_denoiser/fs.py", line 137, in forward_dur
masked_dur_gt = mel2token_to_dur(mel2ph(1-time_mel_masks).squeeze(-1).long(), T) * nonpadding
File "/data3/liukaiyang/Speech-Editing-Toolkit/utils/audio/align.py", line 85, in mel2token_to_dur
dur = mel2token.new_zeros(B, T_txt + 1).scatter_add(1, mel2token, torch.ones_like(mel2token))
RuntimeError: index 86 is out of bounds for dimension 1 with size 86