Zain-Jiang / Speech-Editing-Toolkit

It's a repository for implementations of neural speech editing algorithms.
191 stars 19 forks source link

How to use this to destutter an audio file? #7

Open AidanRosen opened 1 year ago

AidanRosen commented 1 year ago

Hi,

I'm trying to use this code to destutter some audio files. What's the process for this?

Zain-Jiang commented 1 year ago

You can use the default inference code to reconstruct the (stutter+adjacent words) region. But you need to manually define this region (make the time_mel_mask to include the stutter).

AidanRosen commented 1 year ago

Ok so is it an executable I run or do I edit the default inference code to target a file? How would I use the default inference code in the first place

Zain-Jiang commented 1 year ago

First, you can try the example inference pipeline provided in the README. Then you can substitute the example audio and the csv file with your custom clean audio. Finally, you can try the stutter removal process by some small modifications to the inference code.

AidanRosen commented 1 year ago

Is that this one?

run with one example

python inference/tts/spec_denoiser.py --exp_name spec_denoiser

Thank you!

Zain-Jiang commented 1 year ago

Did you use the latest code in our repo?

This one is recommended. image

you're welcome

AidanRosen commented 1 year ago

Yes I downloaded the latest code today. I ran the command above, but I'm not sure how to modify it to take an audio file and output a destuttered audio file, like the ones in the demo for FluentSpeech. I guess what I really need is FluentSpeech.

AidanRosen commented 1 year ago

Oh wait does the code need text to reconstruct the audio? That'd make sense, and my project has transcription capabilities. How do I go about taking text and an audio file then to return unstuttered audio

AidanRosen commented 1 year ago

I ended up manually downloading mfa_dict and mda_model.zip from the google drive, but I'm getting this error when running the inference example command:

Traceback (most recent call last): File "C:\Users\aidan\Downloads\Speech-Editing-Toolkit-stable\Speech-Editing-Toolkit-stable\inference\tts\spec_denoiser.py", line 352, in SpecDenoiserInfer.example_run(dataset_info) File "C:\Users\aidan\Downloads\Speech-Editing-Toolkit-stable\Speech-Editing-Toolkit-stable\inference\tts\spec_denoiser.py", line 255, in example_run infer_ins = cls(hp) File "C:\Users\aidan\Downloads\Speech-Editing-Toolkit-stable\Speech-Editing-Toolkit-stable\inference\tts\spec_denoiser.py", line 38, in init self.data_dir = hparams['binary_data_dir'] KeyError: 'binary_data_dir

Thoughts?

AidanRosen commented 1 year ago

Oh forgot to add:

'mfa' is not recognized as an internal or external command, operable program or batch file. Generating forced alignments with mfa. Please wait for about several minutes. mfa align -j 4 --clean inference/audio data/processed/libritts/mfa_dict.txt data/processed/libritts/mfa_model.zip inference/audio/mfa_out | Unknow hparams: [] | Hparams chains: [] | Hparams: debug: False, exp_name: spec_denoiser, infer: False, validate: False, work_dir: checkpoints/spec_denoiser,

Zain-Jiang commented 1 year ago

@AidanRosen

  1. Did you download the pre-trained FluentSpeech weights?
  2. Did you install the montreal-forced-aligner following the readme?
darkzbaron commented 1 year ago

@Zain-Jiang What is the exact process for stutter removal, you mentionned modification to perform in the inference code, could you elaborate? My goal would be to eliminate stutter without specifying word position. May be by removing words which are not in the edited text? Also what about the stutter_speech.py file ?

darkzbaron commented 1 year ago

have tried running the spec_denoiser copy and uncommented some lines but this is as far as get Traceback (most recent call last): File "inference/tts/spec_denoiser_copy.py", line 261, in <module> StutterSpeechInfer.example_run() File "inference/tts/spec_denoiser_copy.py", line 247, in example_run wav_out, wav_gt, mel_out, mel_gt = infer_ins.infer_once(inp) File "/mnt/c/Users/Jonathan/Apps/Audio/Speech-Editing-Toolkit/inference/tts/base_tts_infer.py", line 97, in infer_once output = self.forward_model(inp) File "inference/tts/spec_denoiser_copy.py", line 112, in forward_model output = self.model(sample['edited_txt_tokens'], time_mel_masks=time_mel_masks, mel2ph=edited_mel2ph, spk_embed=sample['spk_embed'], File "/root/miniconda3/envs/SET/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/mnt/c/Users/Jonathan/Apps/Audio/Speech-Editing-Toolkit/modules/speech_editing/spec_denoiser/spec_denoiser.py", line 159, in forward ret = self.fs(txt_tokens, time_mel_masks, mel2ph, spk_embed, f0, uv, energy, File "/root/miniconda3/envs/SET/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/mnt/c/Users/Jonathan/Apps/Audio/Speech-Editing-Toolkit/modules/speech_editing/spec_denoiser/fs.py", line 99, in forward decoder_inp = decoder_inp + self.forward_pitch(pitch_inp, time_mel_masks, f0, uv, mel2ph, ret, use_pred_pitch=use_pred_pitch) File "/mnt/c/Users/Jonathan/Apps/Audio/Speech-Editing-Toolkit/modules/speech_editing/spec_denoiser/fs.py", line 160, in forward_pitch masked_f0 = f0*(1-time_mel_masks) TypeError: unsupported operand type(s) for *: 'NoneType' and 'Tensor'