Open AidanRosen opened 1 year ago
You can use the default inference code to reconstruct the (stutter+adjacent words) region. But you need to manually define this region (make the time_mel_mask
to include the stutter).
Ok so is it an executable I run or do I edit the default inference code to target a file? How would I use the default inference code in the first place
First, you can try the example inference pipeline provided in the README. Then you can substitute the example audio and the csv file with your custom clean audio. Finally, you can try the stutter removal process by some small modifications to the inference code.
Is that this one?
python inference/tts/spec_denoiser.py --exp_name spec_denoiser
Thank you!
Did you use the latest code in our repo?
This one is recommended.
you're welcome
Yes I downloaded the latest code today. I ran the command above, but I'm not sure how to modify it to take an audio file and output a destuttered audio file, like the ones in the demo for FluentSpeech. I guess what I really need is FluentSpeech.
Oh wait does the code need text to reconstruct the audio? That'd make sense, and my project has transcription capabilities. How do I go about taking text and an audio file then to return unstuttered audio
I ended up manually downloading mfa_dict and mda_model.zip from the google drive, but I'm getting this error when running the inference example command:
Traceback (most recent call last):
File "C:\Users\aidan\Downloads\Speech-Editing-Toolkit-stable\Speech-Editing-Toolkit-stable\inference\tts\spec_denoiser.py", line 352, in
Thoughts?
Oh forgot to add:
'mfa' is not recognized as an internal or external command, operable program or batch file. Generating forced alignments with mfa. Please wait for about several minutes. mfa align -j 4 --clean inference/audio data/processed/libritts/mfa_dict.txt data/processed/libritts/mfa_model.zip inference/audio/mfa_out | Unknow hparams: [] | Hparams chains: [] | Hparams: debug: False, exp_name: spec_denoiser, infer: False, validate: False, work_dir: checkpoints/spec_denoiser,
@AidanRosen
@Zain-Jiang What is the exact process for stutter removal, you mentionned modification to perform in the inference code, could you elaborate? My goal would be to eliminate stutter without specifying word position. May be by removing words which are not in the edited text? Also what about the stutter_speech.py file ?
have tried running the spec_denoiser copy and uncommented some lines but this is as far as get
Traceback (most recent call last): File "inference/tts/spec_denoiser_copy.py", line 261, in <module> StutterSpeechInfer.example_run() File "inference/tts/spec_denoiser_copy.py", line 247, in example_run wav_out, wav_gt, mel_out, mel_gt = infer_ins.infer_once(inp) File "/mnt/c/Users/Jonathan/Apps/Audio/Speech-Editing-Toolkit/inference/tts/base_tts_infer.py", line 97, in infer_once output = self.forward_model(inp) File "inference/tts/spec_denoiser_copy.py", line 112, in forward_model output = self.model(sample['edited_txt_tokens'], time_mel_masks=time_mel_masks, mel2ph=edited_mel2ph, spk_embed=sample['spk_embed'], File "/root/miniconda3/envs/SET/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/mnt/c/Users/Jonathan/Apps/Audio/Speech-Editing-Toolkit/modules/speech_editing/spec_denoiser/spec_denoiser.py", line 159, in forward ret = self.fs(txt_tokens, time_mel_masks, mel2ph, spk_embed, f0, uv, energy, File "/root/miniconda3/envs/SET/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/mnt/c/Users/Jonathan/Apps/Audio/Speech-Editing-Toolkit/modules/speech_editing/spec_denoiser/fs.py", line 99, in forward decoder_inp = decoder_inp + self.forward_pitch(pitch_inp, time_mel_masks, f0, uv, mel2ph, ret, use_pred_pitch=use_pred_pitch) File "/mnt/c/Users/Jonathan/Apps/Audio/Speech-Editing-Toolkit/modules/speech_editing/spec_denoiser/fs.py", line 160, in forward_pitch masked_f0 = f0*(1-time_mel_masks) TypeError: unsupported operand type(s) for *: 'NoneType' and 'Tensor'
Hi,
I'm trying to use this code to destutter some audio files. What's the process for this?