Zain-Jiang / Speech-Editing-Toolkit

It's a repository for implementations of neural speech editing algorithms.
187 stars 19 forks source link

Where to find mfa_dict.txt and mfa_model.zip? #4

Open mvoodarla opened 1 year ago

mvoodarla commented 1 year ago

Hi! I'm getting the following error when running python inference/tts/spec_denoiser.py --exp_name spec_denoiser. Where can I find the required files? I'm trying to run the basic pre-trained inference of FluentSpeech.

Traceback (most recent call last):
  File "inference/tts/spec_denoiser.py", line 350, in <module>
    dataset_info = data_preprocess(test_file_path, test_wav_directory, dictionary_path, acoustic_model_path,
  File "inference/tts/spec_denoiser.py", line 297, in data_preprocess
    assert os.path.exists(file_path) and os.path.exists(input_directory) and os.path.exists(acoustic_model_path), \
AssertionError: inference/example.csv,inference/audio,data/processed/libritts/mfa_dict.txt,data/processed/libritts/mfa_model.zip
Linghuxc commented 1 year ago

May I ask if you have solved the problem? I used pre-trained models for inference and still ran into a lot of problems

Zain-Jiang commented 1 year ago

Sorry for making you wait for so long.

  1. The inference/example.csv is in https://github.com/Zain-Jiang/Speech-Editing-Toolkit/blob/stable/inference/example.csv;
  2. The data/processed/libritts/mfa_dict.txt, data/processed/libritts/mfa_model.zip, and other files you need to run the basic pre-trained inference of FluentSpeech can be created by the preprocess script in our repo. You can also download them in https://drive.google.com/drive/folders/1H-dk7cNYVn1DSzYq_q66rS5b5xpbdBi4?usp=sharing and put them in data/processed/libritts.
  3. Please specify the MFA version as 2.0.0rc3.

If you find any other problems, please contact me. Thank you very much.

Linghuxc commented 1 year ago

Hi, @Zain-Jiang

  1. I downloaded the relevant documents you provided from this link https://drive.google.com/drive/folders/1H-dk7cNYVn1DSzYq_q66rS5b5xpbdBi4?usp=sharing to complete the inference step.
  2. Based on my experiments, it seems that phone_set.json, spk_map.json, word_set.jsonneeds to be placed in data/binary/hifitts_wav.
  3. However, I still encountered the following problems in the follow-up process
    INFO - Setting up corpus information...
    INFO - Loading corpus from source files...
    100%|██████████| 1/1 [00:01<00:00,  1.01s/it]
    INFO - Number of speakers in corpus: 1, average number of utterances per speaker: 1.0
    INFO - Setting up training data...
    INFO - Generating base features (mfcc)...
    INFO - Generating MFCCs...
    100%|██████████| 1/1 [00:01<00:00,  1.05s/it]
    INFO - Calculating CMVN...
    INFO - Compiling training graphs...
    100%|██████████| 1/1 [00:01<00:00,  1.07s/it]
    INFO - Performing first-pass alignment...
    INFO - Generating alignments...
    100%|██████████| 1/1 [00:01<00:00,  1.18s/it]
    INFO - Calculating fMLLR for speaker adaptation...
    100%|██████████| 1/1 [00:01<00:00,  1.17s/it]
    INFO - Performing second-pass alignment...
    INFO - Generating alignments...
    100%|██████████| 1/1 [00:01<00:00,  1.20s/it]
    INFO - Collecting word alignments from alignment lattices...
    100%|██████████| 1/1 [00:01<00:00,  1.06s/it]
    INFO - Collecting phone alignments from alignment lattices...
    100%|██████████| 1/1 [00:01<00:00,  1.08s/it]
    INFO - Exporting TextGrids to inference/audio/mfa_out...
    100%|██████████| 1/1 [00:01<00:00,  1.07s/it]
    INFO - Finished exporting TextGrids to inference/audio/mfa_out!
    INFO - Done! Everything took 12.852613925933838 seconds
    Generating forced alignments with mfa. Please wait for about several minutes.
    mfa align -j 4 --clean inference/audio data/processed/libritts/mfa_dict.txt data/processed/libritts/mfa_model.zip inference/audio/mfa_out
    | Unknow hparams:  []
    | Hparams chains:  []
    | Hparams: 
    accumulate_grad_batches: 1, adam_b1: 0.8, adam_b2: 0.99, amp: False, audio_num_mel_bins: 80, 
    audio_sample_rate: 22050, aux_context_window: 0, base_config: ['egs/egs_bases/tts/vocoder/hifigan.yaml', './base.yaml'], binarization_args: {'reset_phone_dict': True, 'reset_word_dict': True, 'shuffle': True, 'trim_eos_bos': False, 'trim_sil': False, 'with_align': False, 'with_f0': True, 'with_f0cwt': False, 'with_linear': False, 'with_spk_embed': False, 'with_spk_id': True, 'with_txt': False, 'with_wav': True, 'with_word': False}, binarizer_cls: data_gen.tts.base_binarizer.BaseBinarizer, 
    binary_data_dir: data/binary/hifitts_wav, check_val_every_n_epoch: 10, clip_grad_norm: 1, clip_grad_value: 0, debug: False, 
    dec_ffn_kernel_size: 9, dec_layers: 4, dict_dir: , disc_start_steps: 40000, discriminator_grad_norm: 1, 
    discriminator_optimizer_params: {'lr': 0.0002}, discriminator_scheduler_params: {'gamma': 0.999, 'step_size': 600}, dropout: 0.1, ds_workers: 1, enc_ffn_kernel_size: 9, 
    enc_layers: 4, endless_ds: True, exp_name: spec_denoiser, ffn_act: gelu, ffn_padding: SAME, 
    fft_size: 1024, fmax: 7600, fmin: 80, frames_multiple: 1, gen_dir_name: , 
    generator_grad_norm: 10, generator_optimizer_params: {'lr': 0.0002}, generator_scheduler_params: {'gamma': 0.999, 'step_size': 600}, griffin_lim_iters: 60, hidden_size: 256, 
    hop_size: 256, infer: False, lambda_adv: 1.0, lambda_cdisc: 4.0, lambda_mel: 5.0, 
    lambda_mel_adv: 1.0, load_ckpt: , loud_norm: False, lr: 2.0, max_epochs: 1000, 
    max_frames: 1548, max_input_tokens: 1550, max_samples: 8192, max_sentences: 24, max_tokens: 30000, 
    max_updates: 3000000, max_valid_sentences: 1, max_valid_tokens: 60000, mel_vmax: 1.5, mel_vmin: -6, 
    min_frames: 0, min_level_db: -100, num_ckpt_keep: 3, num_heads: 2, num_mels: 80, 
    num_sanity_val_steps: 5, num_spk: 50, optimizer_adam_beta1: 0.9, optimizer_adam_beta2: 0.98, out_wav_norm: False, 
    pitch_extractor: parselmouth, pre_align_args: {'allow_no_txt': False, 'denoise': False, 'sox_resample': False, 'sox_to_wav': True, 'trim_sil': False, 'txt_processor': 'en', 'use_tone': True}, pre_align_cls: egs.datasets.audio.hifitts.pre_align.HifiTTSPreAlign, print_nan_grads: False, processed_data_dir: data/processed/hifitts, 
    profile_infer: False, raw_data_dir: data/raw/hifi-tts, ref_level_db: 20, rename_tmux: True, resblock: 1, 
    resblock_dilation_sizes: [[1, 3, 5], [1, 3, 5], [1, 3, 5]], resblock_kernel_sizes: [3, 7, 11], resume_from_checkpoint: 0, save_best: True, save_codes: [], 
    save_f0: False, save_gt: True, scheduler: rsqrt, seed: 1234, sort_by_len: True, 
    task_cls: tasks.vocoder.hifigan.HifiGanTask, tb_log_interval: 100, test_input_dir: , test_num: 200, test_set_name: test, 
    train_set_name: train, upsample_initial_channel: 512, upsample_kernel_sizes: [16, 16, 4, 4], upsample_rates: [8, 8, 2, 2], use_cdisc: False, 
    use_cond_disc: False, use_fm_loss: False, use_ms_stft: False, use_pitch_embed: False, use_spec_disc: False, 
    use_spk_id: True, val_check_interval: 2000, valid_infer_interval: 10000, valid_monitor_key: val_loss, valid_monitor_mode: min, 
    valid_set_name: valid, validate: False, vocoder: pwg, vocoder_ckpt: , warmup_updates: 8000, 
    weight_decay: 0, win_length: None, win_size: 1024, window: hann, word_size: 30000, 
    work_dir: checkpoints/spec_denoiser, 
    Traceback (most recent call last):
    File "inference/tts/spec_denoiser.py", line 352, in <module>
    SpecDenoiserInfer.example_run(dataset_info)
    File "inference/tts/spec_denoiser.py", line 255, in example_run
    infer_ins = cls(hp)
    File "inference/tts/spec_denoiser.py", line 42, in __init__
    self.model = self.build_model()
    File "inference/tts/spec_denoiser.py", line 53, in build_model
    out_dims=hparams['audio_num_mel_bins'], denoise_fn=DIFF_DECODERS[hparams['diff_decoder_type']](hparams),
    KeyError: 'diff_decoder_type'

    Could you help me? Thank you!

Zain-Jiang commented 1 year ago

@Linghuxc

  1. phone_set.json should be placed in the directory defined by hparams['binary_data_dir'].

The hparams['diff_decoder_type'] is defined in the config.yaml of our pre-trained checkpoint and will be loaded automatically. It seems that the config.yaml has not been loaded correctly.

Linghuxc commented 1 year ago

@Zain-Jiang Yes, you are right. I found no hparams['diff_decoder_type'] in config.yaml. They are in spec_denoiser_libritts.yaml. So maybe we need to load spec_denoiser_libritts.yaml instead of the default config.yaml?

Zain-Jiang commented 1 year ago

@Linghuxc The original config.yaml in the link https://drive.google.com/drive/folders/1saqpWc4vrSgUZvRvHkf2QbwWSikMTyoo?usp=sharing has diff_decoder_type. I'm sorry that config.yaml might be replaced in some of the preprocess steps. I will check about it. Loading the spec_denoiser_libritts.yaml is also a good choice to fix it.

Linghuxc commented 1 year ago

@Zain-Jiang I checked theconfig.yaml file in this link https://drive.google.com/drive/folders/1saqpWc4vrSgUZvRvHkf2QbwWSikMTyoo?usp=sharing and there is none diff_decoder_type. The spec_denoiser_libritts.yaml and config.yaml I mentioned earlier are here https://github.com/Zain-Jiang/Speech-Editing-Toolkit/tree/stable/egs. Does it use fluentspeech/egs configuration files as part of the inference process?

Zain-Jiang commented 1 year ago

@Linghuxc spec_denoiser can be seen as the fluentspeech model without the stutter removal parts. So using spec_denoiser_libritts.yaml for the inference process is ok.

Linghuxc commented 1 year ago

@Zain-Jiang Sorry,I tried spec_denoiser.yaml instead of config.yaml and made some changes to the path. This is my current config.yamlfile:

## Training
accumulate_grad_batches: 1
add_word_pos: true
amp: false
audio_num_mel_bins: 80
audio_sample_rate: 22050
binarizer_cls: data_gen.tts.base_binarizer.BaseBinarizer
raw_data_dir: data/raw/libritts
processed_data_dir: data/processed/libritts
binary_data_dir: data/binary/hifitts_wav
check_val_every_n_epoch: 10
clip_grad_norm: 1
clip_grad_value: 0
debug: false
ds_name: libritts
ds_workers: 2
endless_ds: true
eval_max_batches: -1
lr: 0.0002
load_ckpt: ''
max_epochs: 1000
max_frames: 1548
max_input_tokens: 1550
max_sentences: 16
max_tokens: 40000
max_updates: 2000000
max_valid_sentences: 1
max_valid_tokens: 60000
num_ckpt_keep: 3
num_sanity_val_steps: 5
num_spk: 1261
num_valid_plots: 10
optimizer_adam_beta1: 0.9
optimizer_adam_beta2: 0.98
posterior_start_steps: 0
print_nan_grads: false
profile_infer: false
rename_tmux: true
resume_from_checkpoint: 0
save_best: false
save_codes:
- tasks
- modules
save_f0: false
save_gt: true
scheduler: warmup
seed: 1234
sigmoid_scale: false
sort_by_len: true
task_cls: tasks.speech_editing.spec_denoiser.SpeechDenoiserTask
tb_log_interval: 100
test_input_yaml: ''
test_num: 100
test_set_name: test
train_set_name: train
train_sets: ''
two_stage: true
val_check_interval: 2000
valid_infer_interval: 2000
valid_monitor_key: val_loss
valid_monitor_mode: min
valid_set_name: valid
warmup_updates: 8000
weight_decay: 0
word_dict_size: 40500
mask_ratio: 0.12

mask_type: 'alignment_aware'
training_mask_ratio: 0.80
infer_mask_ratio: 0.30

diff_decoder_type: 'wavenet'
latent_cond_type: 'add'
dilation_cycle_length: 1
residual_layers: 20
residual_channels: 256
keep_bins: 80
spec_min: [ ]
spec_max: [ ]
diff_loss_type: l1
max_beta: 0.06
## diffusion
timesteps: 8
timescale: 1
schedule_type: 'vpsde'

conv_use_pos: false
dec_dilations:
- 1
- 1
- 1
- 1
dec_ffn_kernel_size: 9
dec_inp_add_noise: false
dec_kernel_size: 5
dec_layers: 4
dec_post_net_kernel: 3
decoder_rnn_dim: 0
decoder_type: conv
detach_postflow_input: true
dropout: 0.0
dur_level: word
dur_predictor_kernel: 5
dur_predictor_layers: 3
enc_dec_norm: ln
enc_dilations:
- 1
- 1
- 1
- 1
enc_ffn_kernel_size: 5
enc_kernel_size: 5
enc_layers: 4
enc_post_net_kernel: 3
enc_pre_ln: true
enc_prenet: true
encoder_K: 8
encoder_type: conv
ffn_act: gelu
ffn_hidden_size: 768
fft_size: 1024
hidden_size: 192
hop_size: 256
latent_size: 16
layers_in_block: 2
num_heads: 2
mel_disc_hidden_size: 128
predictor_dropout: 0.2
predictor_grad: 0.1
predictor_hidden: -1
predictor_kernel: 5
predictor_layers: 5
prior_flow_hidden: 64
prior_flow_kernel_size: 3
prior_flow_n_blocks: 4
ref_norm_layer: bn
share_wn_layers: 4
text_encoder_postnet: true
use_cond_proj: false
use_gt_dur: false
use_gt_f0: false
use_latent_cond: false
use_pitch_embed: true
use_pos_embed: true
use_post_flow: true
use_prior_flow: true
use_spk_embed: true
use_spk_id: false
use_txt_cond: true
use_uv: true
mel_enc_layers: 4

f0_max: 600
f0_min: 80
fmax: 7600
fmin: 55
frames_multiple: 1
loud_norm: false
mel_vmax: 1.5
mel_vmin: -6
min_frames: 0
noise_scale: 0.8
win_size: 1024
pitch_extractor: parselmouth
pitch_type: frame

gen_dir_name: ''
infer: false
infer_post_glow: true
out_wav_norm: false
test_ids: [ ]
eval_mcd: False

kl_min: 0.0
kl_start_steps: 10000
lambda_commit: 0.25
lambda_energy: 0.1
lambda_f0: 1.0
lambda_kl: 1.0
lambda_mel_adv: 0.05
lambda_ph_dur: 0.1
lambda_sent_dur: 0.0
lambda_uv: 1.0
lambda_word_dur: 1.0
mel_losses: l1:0.5|ssim:0.5

vocoder: HifiGAN
vocoder_ckpt: pretrained/hifigan_hifitts`

The following issues have occurred and the model cannot be found:

INFO - Setting up corpus information... INFO - Loading corpus from source files... 100%|██████████| 1/1 [00:01<00:00, 1.01s/it] INFO - Number of speakers in corpus: 1, average number of utterances per speaker: 1.0 INFO - Setting up training data... INFO - Generating base features (mfcc)... INFO - Generating MFCCs... 100%|██████████| 1/1 [00:01<00:00, 1.04s/it] INFO - Calculating CMVN... INFO - Compiling training graphs... 100%|██████████| 1/1 [00:01<00:00, 1.06s/it] INFO - Performing first-pass alignment... INFO - Generating alignments... 100%|██████████| 1/1 [00:01<00:00, 1.17s/it] INFO - Calculating fMLLR for speaker adaptation... 100%|██████████| 1/1 [00:01<00:00, 1.14s/it] INFO - Performing second-pass alignment... INFO - Generating alignments... 100%|██████████| 1/1 [00:01<00:00, 1.19s/it] INFO - Collecting word alignments from alignment lattices... 100%|██████████| 1/1 [00:01<00:00, 1.05s/it] INFO - Collecting phone alignments from alignment lattices... 100%|██████████| 1/1 [00:01<00:00, 1.06s/it] INFO - Exporting TextGrids to inference/audio/mfa_out... 100%|██████████| 1/1 [00:01<00:00, 1.07s/it] INFO - Finished exporting TextGrids to inference/audio/mfa_out! INFO - Done! Everything took 12.57474946975708 seconds Generating forced alignments with mfa. Please wait for about several minutes. mfa align -j 4 --clean inference/audio data/processed/libritts/mfa_dict.txt data/processed/libritts/mfa_model.zip inference/audio/mfa_out | Unknow hparams: [] | Hparams chains: [] | Hparams: accumulate_grad_batches: 1, add_word_pos: True, amp: False, audio_num_mel_bins: 80, audio_sample_rate: 22050, binarizer_cls: data_gen.tts.base_binarizer.BaseBinarizer, binary_data_dir: data/binary/hifitts_wav, check_val_every_n_epoch: 10, clip_grad_norm: 1, clip_grad_value: 0, conv_use_pos: False, debug: False, dec_dilations: [1, 1, 1, 1], dec_ffn_kernel_size: 9, dec_inp_add_noise: False, dec_kernel_size: 5, dec_layers: 4, dec_post_net_kernel: 3, decoder_rnn_dim: 0, decoder_type: conv, detach_postflow_input: True, diff_decoder_type: wavenet, diff_loss_type: l1, dilation_cycle_length: 1, dropout: 0.0, ds_name: libritts, ds_workers: 2, dur_level: word, dur_predictor_kernel: 5, dur_predictor_layers: 3, enc_dec_norm: ln, enc_dilations: [1, 1, 1, 1], enc_ffn_kernel_size: 5, enc_kernel_size: 5, enc_layers: 4, enc_post_net_kernel: 3, enc_pre_ln: True, enc_prenet: True, encoder_K: 8, encoder_type: conv, endless_ds: True, eval_max_batches: -1, eval_mcd: False, exp_name: spec_denoiser, f0_max: 600, f0_min: 80, ffn_act: gelu, ffn_hidden_size: 768, fft_size: 1024, fmax: 7600, fmin: 55, frames_multiple: 1, gen_dir_name: , hidden_size: 192, hop_size: 256, infer: False, infer_mask_ratio: 0.3, infer_post_glow: True, keep_bins: 80, kl_min: 0.0, kl_start_steps: 10000, lambda_commit: 0.25, lambda_energy: 0.1, lambda_f0: 1.0, lambda_kl: 1.0, lambda_mel_adv: 0.05, lambda_ph_dur: 0.1, lambda_sent_dur: 0.0, lambda_uv: 1.0, lambda_word_dur: 1.0, latent_cond_type: add, latent_size: 16, layers_in_block: 2, load_ckpt: , loud_norm: False, lr: 0.0002, mask_ratio: 0.12, mask_type: alignment_aware, max_beta: 0.06, max_epochs: 1000, max_frames: 1548, max_input_tokens: 1550, max_sentences: 16, max_tokens: 40000, max_updates: 2000000, max_valid_sentences: 1, max_valid_tokens: 60000, mel_disc_hidden_size: 128, mel_enc_layers: 4, mel_losses: l1:0.5|ssim:0.5, mel_vmax: 1.5, mel_vmin: -6, min_frames: 0, noise_scale: 0.8, num_ckpt_keep: 3, num_heads: 2, num_sanity_val_steps: 5, num_spk: 1261, num_valid_plots: 10, optimizer_adam_beta1: 0.9, optimizer_adam_beta2: 0.98, out_wav_norm: False, pitch_extractor: parselmouth, pitch_type: frame, posterior_start_steps: 0, predictor_dropout: 0.2, predictor_grad: 0.1, predictor_hidden: -1, predictor_kernel: 5, predictor_layers: 5, print_nan_grads: False, prior_flow_hidden: 64, prior_flow_kernel_size: 3, prior_flow_n_blocks: 4, processed_data_dir: data/processed/libritts, profile_infer: False, raw_data_dir: data/raw/libritts, ref_norm_layer: bn, rename_tmux: True, residual_channels: 256, residual_layers: 20, resume_from_checkpoint: 0, save_best: False, save_codes: ['tasks', 'modules'], save_f0: False, save_gt: True, schedule_type: vpsde, scheduler: warmup, seed: 1234, share_wn_layers: 4, sigmoid_scale: False, sort_by_len: True, spec_max: [], spec_min: [], task_cls: tasks.speech_editing.spec_denoiser.SpeechDenoiserTask, tb_log_interval: 100, test_ids: [], test_input_yaml: , test_num: 100, test_set_name: test, text_encoder_postnet: True, timescale: 1, timesteps: 8, train_set_name: train, train_sets: , training_mask_ratio: 0.8, two_stage: True, use_cond_proj: False, use_gt_dur: False, use_gt_f0: False, use_latent_cond: False, use_pitch_embed: True, use_pos_embed: True, use_post_flow: True, use_prior_flow: True, use_spk_embed: True, use_spk_id: False, use_txt_cond: True, use_uv: True, val_check_interval: 2000, valid_infer_interval: 2000, valid_monitor_key: val_loss, valid_monitor_mode: min, valid_set_name: valid, validate: False, vocoder: HifiGAN, vocoder_ckpt: pretrained/hifigan_hifitts, warmup_updates: 8000, weight_decay: 0, win_size: 1024, word_dict_size: 40500, work_dir: checkpoints/spec_denoiser, Traceback (most recent call last): File "inference/tts/spec_denoiser.py", line 352, in <module> SpecDenoiserInfer.example_run(dataset_info) File "inference/tts/spec_denoiser.py", line 255, in example_run infer_ins = cls(hp) File "inference/tts/spec_denoiser.py", line 42, in __init__ self.model = self.build_model() File "inference/tts/spec_denoiser.py", line 58, in build_model load_ckpt(model, hparams['work_dir'], 'model') File "/home/yinhaowen/fluentspeech/utils/commons/ckpt_utils.py", line 41, in load_ckpt state_dict = state_dict[model_name] KeyError: 'model' I don't know what the value of model should be, maybe I should do a pre-processing to regenerate the config.json .

Zain-Jiang commented 1 year ago

Is the pre-trained ckpt placed in the hparams['work_dir']?

Linghuxc commented 1 year ago

@Zain-Jiang Yes, I set the hparams['work_dir'] is checkpoints/spec_denoiser/model_ckpt_steps_2168000.ckpt. keyError:model is still exists.

Linghuxc commented 1 year ago

When I debug it, the value of model_name is'model', which does not contain '.', and the code findsmodel_name in the dictionary state_dict.

if checkpoint is not None:
        state_dict = checkpoint["state_dict"]
        if len([k for k in state_dict.keys() if '.' in k]) > 0:
            state_dict = {k[len(model_name) + 1:]: v for k, v in state_dict.items()
                          if k.startswith(f'{model_name}.')}
        else:
            if '.' not in model_name:
                state_dict = state_dict[model_name]
            else:
                base_model_name = model_name.split('.')[0]
                rest_model_name = model_name[len(base_model_name) + 1:]
                state_dict = {
                    k[len(rest_model_name) + 1:]: v for k, v in state_dict[base_model_name].items()
                    if k.startswith(f'{rest_model_name}.')}

However, the value "model" is not included in state_dict image

and It seems that setting work_dir in config.yaml doesn't work because work_dir is automatically set to checkpoints/"--exp_name" in hparams.py

Zain-Jiang commented 1 year ago

@Linghuxc hh, I find the reason.

The FluentSpeech model we provide is in the following link https://drive.google.com/drive/folders/1saqpWc4vrSgUZvRvHkf2QbwWSikMTyoo?usp=sharing, and the checkpoint's name is model_ckpt_steps_568000.ckpt. Runing the inference code also needs a pre-trained HifiGAN Vocoder provided in https://drive.google.com/drive/folders/1n_0tROauyiAYGUDbmoQ__eqyT_G4RvjN?usp=sharing, and the checkpoint's name is model_ckpt_steps_2168000.ckpt.

Perhaps you have loaded HifiGAN to FluentSpeech.

FluentSpeech is used to edit the mel-spec and HifiGAN is the vocoder we used to transform mel-spec to wavform. I'm sure that if you load the model correctly, you can freely enjoy the beauty of speech editing.

Linghuxc commented 1 year ago

@Zain-Jiang Oh! yes,I loaded the model incorrectly! Now I can successfully generate the edited audio. Thank you very much for your help! I am very sorry to delay your time due to my negligence!