Open mvoodarla opened 1 year ago
May I ask if you have solved the problem? I used pre-trained models for inference and still ran into a lot of problems
Sorry for making you wait for so long.
inference/example.csv
is in https://github.com/Zain-Jiang/Speech-Editing-Toolkit/blob/stable/inference/example.csv
;data/processed/libritts/mfa_dict.txt
, data/processed/libritts/mfa_model.zip
, and other files you need to run the basic pre-trained inference of FluentSpeech can be created by the preprocess script in our repo. You can also download them in https://drive.google.com/drive/folders/1H-dk7cNYVn1DSzYq_q66rS5b5xpbdBi4?usp=sharing
and put them in data/processed/libritts
. If you find any other problems, please contact me. Thank you very much.
Hi, @Zain-Jiang
https://drive.google.com/drive/folders/1H-dk7cNYVn1DSzYq_q66rS5b5xpbdBi4?usp=sharing
to complete the inference step.phone_set.json, spk_map.json, word_set.json
needs to be placed in data/binary/hifitts_wav
.INFO - Setting up corpus information...
INFO - Loading corpus from source files...
100%|██████████| 1/1 [00:01<00:00, 1.01s/it]
INFO - Number of speakers in corpus: 1, average number of utterances per speaker: 1.0
INFO - Setting up training data...
INFO - Generating base features (mfcc)...
INFO - Generating MFCCs...
100%|██████████| 1/1 [00:01<00:00, 1.05s/it]
INFO - Calculating CMVN...
INFO - Compiling training graphs...
100%|██████████| 1/1 [00:01<00:00, 1.07s/it]
INFO - Performing first-pass alignment...
INFO - Generating alignments...
100%|██████████| 1/1 [00:01<00:00, 1.18s/it]
INFO - Calculating fMLLR for speaker adaptation...
100%|██████████| 1/1 [00:01<00:00, 1.17s/it]
INFO - Performing second-pass alignment...
INFO - Generating alignments...
100%|██████████| 1/1 [00:01<00:00, 1.20s/it]
INFO - Collecting word alignments from alignment lattices...
100%|██████████| 1/1 [00:01<00:00, 1.06s/it]
INFO - Collecting phone alignments from alignment lattices...
100%|██████████| 1/1 [00:01<00:00, 1.08s/it]
INFO - Exporting TextGrids to inference/audio/mfa_out...
100%|██████████| 1/1 [00:01<00:00, 1.07s/it]
INFO - Finished exporting TextGrids to inference/audio/mfa_out!
INFO - Done! Everything took 12.852613925933838 seconds
Generating forced alignments with mfa. Please wait for about several minutes.
mfa align -j 4 --clean inference/audio data/processed/libritts/mfa_dict.txt data/processed/libritts/mfa_model.zip inference/audio/mfa_out
| Unknow hparams: []
| Hparams chains: []
| Hparams:
accumulate_grad_batches: 1, adam_b1: 0.8, adam_b2: 0.99, amp: False, audio_num_mel_bins: 80,
audio_sample_rate: 22050, aux_context_window: 0, base_config: ['egs/egs_bases/tts/vocoder/hifigan.yaml', './base.yaml'], binarization_args: {'reset_phone_dict': True, 'reset_word_dict': True, 'shuffle': True, 'trim_eos_bos': False, 'trim_sil': False, 'with_align': False, 'with_f0': True, 'with_f0cwt': False, 'with_linear': False, 'with_spk_embed': False, 'with_spk_id': True, 'with_txt': False, 'with_wav': True, 'with_word': False}, binarizer_cls: data_gen.tts.base_binarizer.BaseBinarizer,
binary_data_dir: data/binary/hifitts_wav, check_val_every_n_epoch: 10, clip_grad_norm: 1, clip_grad_value: 0, debug: False,
dec_ffn_kernel_size: 9, dec_layers: 4, dict_dir: , disc_start_steps: 40000, discriminator_grad_norm: 1,
discriminator_optimizer_params: {'lr': 0.0002}, discriminator_scheduler_params: {'gamma': 0.999, 'step_size': 600}, dropout: 0.1, ds_workers: 1, enc_ffn_kernel_size: 9,
enc_layers: 4, endless_ds: True, exp_name: spec_denoiser, ffn_act: gelu, ffn_padding: SAME,
fft_size: 1024, fmax: 7600, fmin: 80, frames_multiple: 1, gen_dir_name: ,
generator_grad_norm: 10, generator_optimizer_params: {'lr': 0.0002}, generator_scheduler_params: {'gamma': 0.999, 'step_size': 600}, griffin_lim_iters: 60, hidden_size: 256,
hop_size: 256, infer: False, lambda_adv: 1.0, lambda_cdisc: 4.0, lambda_mel: 5.0,
lambda_mel_adv: 1.0, load_ckpt: , loud_norm: False, lr: 2.0, max_epochs: 1000,
max_frames: 1548, max_input_tokens: 1550, max_samples: 8192, max_sentences: 24, max_tokens: 30000,
max_updates: 3000000, max_valid_sentences: 1, max_valid_tokens: 60000, mel_vmax: 1.5, mel_vmin: -6,
min_frames: 0, min_level_db: -100, num_ckpt_keep: 3, num_heads: 2, num_mels: 80,
num_sanity_val_steps: 5, num_spk: 50, optimizer_adam_beta1: 0.9, optimizer_adam_beta2: 0.98, out_wav_norm: False,
pitch_extractor: parselmouth, pre_align_args: {'allow_no_txt': False, 'denoise': False, 'sox_resample': False, 'sox_to_wav': True, 'trim_sil': False, 'txt_processor': 'en', 'use_tone': True}, pre_align_cls: egs.datasets.audio.hifitts.pre_align.HifiTTSPreAlign, print_nan_grads: False, processed_data_dir: data/processed/hifitts,
profile_infer: False, raw_data_dir: data/raw/hifi-tts, ref_level_db: 20, rename_tmux: True, resblock: 1,
resblock_dilation_sizes: [[1, 3, 5], [1, 3, 5], [1, 3, 5]], resblock_kernel_sizes: [3, 7, 11], resume_from_checkpoint: 0, save_best: True, save_codes: [],
save_f0: False, save_gt: True, scheduler: rsqrt, seed: 1234, sort_by_len: True,
task_cls: tasks.vocoder.hifigan.HifiGanTask, tb_log_interval: 100, test_input_dir: , test_num: 200, test_set_name: test,
train_set_name: train, upsample_initial_channel: 512, upsample_kernel_sizes: [16, 16, 4, 4], upsample_rates: [8, 8, 2, 2], use_cdisc: False,
use_cond_disc: False, use_fm_loss: False, use_ms_stft: False, use_pitch_embed: False, use_spec_disc: False,
use_spk_id: True, val_check_interval: 2000, valid_infer_interval: 10000, valid_monitor_key: val_loss, valid_monitor_mode: min,
valid_set_name: valid, validate: False, vocoder: pwg, vocoder_ckpt: , warmup_updates: 8000,
weight_decay: 0, win_length: None, win_size: 1024, window: hann, word_size: 30000,
work_dir: checkpoints/spec_denoiser,
Traceback (most recent call last):
File "inference/tts/spec_denoiser.py", line 352, in <module>
SpecDenoiserInfer.example_run(dataset_info)
File "inference/tts/spec_denoiser.py", line 255, in example_run
infer_ins = cls(hp)
File "inference/tts/spec_denoiser.py", line 42, in __init__
self.model = self.build_model()
File "inference/tts/spec_denoiser.py", line 53, in build_model
out_dims=hparams['audio_num_mel_bins'], denoise_fn=DIFF_DECODERS[hparams['diff_decoder_type']](hparams),
KeyError: 'diff_decoder_type'
Could you help me? Thank you!
@Linghuxc
phone_set.json
should be placed in the directory defined by hparams['binary_data_dir'].The hparams['diff_decoder_type'] is defined in the config.yaml
of our pre-trained checkpoint and will be loaded automatically. It seems that the config.yaml
has not been loaded correctly.
@Zain-Jiang
Yes, you are right.
I found no hparams['diff_decoder_type'] in config.yaml
. They are in spec_denoiser_libritts.yaml
.
So maybe we need to load spec_denoiser_libritts.yaml
instead of the default config.yaml
?
@Linghuxc
The original config.yaml
in the link https://drive.google.com/drive/folders/1saqpWc4vrSgUZvRvHkf2QbwWSikMTyoo?usp=sharing
has diff_decoder_type
. I'm sorry that config.yaml
might be replaced in some of the preprocess steps. I will check about it.
Loading the spec_denoiser_libritts.yaml
is also a good choice to fix it.
@Zain-Jiang
I checked theconfig.yaml
file in this link https://drive.google.com/drive/folders/1saqpWc4vrSgUZvRvHkf2QbwWSikMTyoo?usp=sharing
and there is none diff_decoder_type
.
The spec_denoiser_libritts.yaml
and config.yaml
I mentioned earlier are here https://github.com/Zain-Jiang/Speech-Editing-Toolkit/tree/stable/egs
.
Does it use fluentspeech/egs
configuration files as part of the inference process?
@Linghuxc
spec_denoiser
can be seen as the fluentspeech model without the stutter removal parts. So using spec_denoiser_libritts.yaml for the inference process is ok.
@Zain-Jiang
Sorry,I tried spec_denoiser.yaml
instead of config.yaml
and made some changes to the path.
This is my current config.yaml
file:
## Training
accumulate_grad_batches: 1
add_word_pos: true
amp: false
audio_num_mel_bins: 80
audio_sample_rate: 22050
binarizer_cls: data_gen.tts.base_binarizer.BaseBinarizer
raw_data_dir: data/raw/libritts
processed_data_dir: data/processed/libritts
binary_data_dir: data/binary/hifitts_wav
check_val_every_n_epoch: 10
clip_grad_norm: 1
clip_grad_value: 0
debug: false
ds_name: libritts
ds_workers: 2
endless_ds: true
eval_max_batches: -1
lr: 0.0002
load_ckpt: ''
max_epochs: 1000
max_frames: 1548
max_input_tokens: 1550
max_sentences: 16
max_tokens: 40000
max_updates: 2000000
max_valid_sentences: 1
max_valid_tokens: 60000
num_ckpt_keep: 3
num_sanity_val_steps: 5
num_spk: 1261
num_valid_plots: 10
optimizer_adam_beta1: 0.9
optimizer_adam_beta2: 0.98
posterior_start_steps: 0
print_nan_grads: false
profile_infer: false
rename_tmux: true
resume_from_checkpoint: 0
save_best: false
save_codes:
- tasks
- modules
save_f0: false
save_gt: true
scheduler: warmup
seed: 1234
sigmoid_scale: false
sort_by_len: true
task_cls: tasks.speech_editing.spec_denoiser.SpeechDenoiserTask
tb_log_interval: 100
test_input_yaml: ''
test_num: 100
test_set_name: test
train_set_name: train
train_sets: ''
two_stage: true
val_check_interval: 2000
valid_infer_interval: 2000
valid_monitor_key: val_loss
valid_monitor_mode: min
valid_set_name: valid
warmup_updates: 8000
weight_decay: 0
word_dict_size: 40500
mask_ratio: 0.12
mask_type: 'alignment_aware'
training_mask_ratio: 0.80
infer_mask_ratio: 0.30
diff_decoder_type: 'wavenet'
latent_cond_type: 'add'
dilation_cycle_length: 1
residual_layers: 20
residual_channels: 256
keep_bins: 80
spec_min: [ ]
spec_max: [ ]
diff_loss_type: l1
max_beta: 0.06
## diffusion
timesteps: 8
timescale: 1
schedule_type: 'vpsde'
conv_use_pos: false
dec_dilations:
- 1
- 1
- 1
- 1
dec_ffn_kernel_size: 9
dec_inp_add_noise: false
dec_kernel_size: 5
dec_layers: 4
dec_post_net_kernel: 3
decoder_rnn_dim: 0
decoder_type: conv
detach_postflow_input: true
dropout: 0.0
dur_level: word
dur_predictor_kernel: 5
dur_predictor_layers: 3
enc_dec_norm: ln
enc_dilations:
- 1
- 1
- 1
- 1
enc_ffn_kernel_size: 5
enc_kernel_size: 5
enc_layers: 4
enc_post_net_kernel: 3
enc_pre_ln: true
enc_prenet: true
encoder_K: 8
encoder_type: conv
ffn_act: gelu
ffn_hidden_size: 768
fft_size: 1024
hidden_size: 192
hop_size: 256
latent_size: 16
layers_in_block: 2
num_heads: 2
mel_disc_hidden_size: 128
predictor_dropout: 0.2
predictor_grad: 0.1
predictor_hidden: -1
predictor_kernel: 5
predictor_layers: 5
prior_flow_hidden: 64
prior_flow_kernel_size: 3
prior_flow_n_blocks: 4
ref_norm_layer: bn
share_wn_layers: 4
text_encoder_postnet: true
use_cond_proj: false
use_gt_dur: false
use_gt_f0: false
use_latent_cond: false
use_pitch_embed: true
use_pos_embed: true
use_post_flow: true
use_prior_flow: true
use_spk_embed: true
use_spk_id: false
use_txt_cond: true
use_uv: true
mel_enc_layers: 4
f0_max: 600
f0_min: 80
fmax: 7600
fmin: 55
frames_multiple: 1
loud_norm: false
mel_vmax: 1.5
mel_vmin: -6
min_frames: 0
noise_scale: 0.8
win_size: 1024
pitch_extractor: parselmouth
pitch_type: frame
gen_dir_name: ''
infer: false
infer_post_glow: true
out_wav_norm: false
test_ids: [ ]
eval_mcd: False
kl_min: 0.0
kl_start_steps: 10000
lambda_commit: 0.25
lambda_energy: 0.1
lambda_f0: 1.0
lambda_kl: 1.0
lambda_mel_adv: 0.05
lambda_ph_dur: 0.1
lambda_sent_dur: 0.0
lambda_uv: 1.0
lambda_word_dur: 1.0
mel_losses: l1:0.5|ssim:0.5
vocoder: HifiGAN
vocoder_ckpt: pretrained/hifigan_hifitts`
The following issues have occurred and the model
cannot be found:
INFO - Setting up corpus information... INFO - Loading corpus from source files... 100%|██████████| 1/1 [00:01<00:00, 1.01s/it] INFO - Number of speakers in corpus: 1, average number of utterances per speaker: 1.0 INFO - Setting up training data... INFO - Generating base features (mfcc)... INFO - Generating MFCCs... 100%|██████████| 1/1 [00:01<00:00, 1.04s/it] INFO - Calculating CMVN... INFO - Compiling training graphs... 100%|██████████| 1/1 [00:01<00:00, 1.06s/it] INFO - Performing first-pass alignment... INFO - Generating alignments... 100%|██████████| 1/1 [00:01<00:00, 1.17s/it] INFO - Calculating fMLLR for speaker adaptation... 100%|██████████| 1/1 [00:01<00:00, 1.14s/it] INFO - Performing second-pass alignment... INFO - Generating alignments... 100%|██████████| 1/1 [00:01<00:00, 1.19s/it] INFO - Collecting word alignments from alignment lattices... 100%|██████████| 1/1 [00:01<00:00, 1.05s/it] INFO - Collecting phone alignments from alignment lattices... 100%|██████████| 1/1 [00:01<00:00, 1.06s/it] INFO - Exporting TextGrids to inference/audio/mfa_out... 100%|██████████| 1/1 [00:01<00:00, 1.07s/it] INFO - Finished exporting TextGrids to inference/audio/mfa_out! INFO - Done! Everything took 12.57474946975708 seconds Generating forced alignments with mfa. Please wait for about several minutes. mfa align -j 4 --clean inference/audio data/processed/libritts/mfa_dict.txt data/processed/libritts/mfa_model.zip inference/audio/mfa_out | Unknow hparams: [] | Hparams chains: [] | Hparams: accumulate_grad_batches: 1, add_word_pos: True, amp: False, audio_num_mel_bins: 80, audio_sample_rate: 22050, binarizer_cls: data_gen.tts.base_binarizer.BaseBinarizer, binary_data_dir: data/binary/hifitts_wav, check_val_every_n_epoch: 10, clip_grad_norm: 1, clip_grad_value: 0, conv_use_pos: False, debug: False, dec_dilations: [1, 1, 1, 1], dec_ffn_kernel_size: 9, dec_inp_add_noise: False, dec_kernel_size: 5, dec_layers: 4, dec_post_net_kernel: 3, decoder_rnn_dim: 0, decoder_type: conv, detach_postflow_input: True, diff_decoder_type: wavenet, diff_loss_type: l1, dilation_cycle_length: 1, dropout: 0.0, ds_name: libritts, ds_workers: 2, dur_level: word, dur_predictor_kernel: 5, dur_predictor_layers: 3, enc_dec_norm: ln, enc_dilations: [1, 1, 1, 1], enc_ffn_kernel_size: 5, enc_kernel_size: 5, enc_layers: 4, enc_post_net_kernel: 3, enc_pre_ln: True, enc_prenet: True, encoder_K: 8, encoder_type: conv, endless_ds: True, eval_max_batches: -1, eval_mcd: False, exp_name: spec_denoiser, f0_max: 600, f0_min: 80, ffn_act: gelu, ffn_hidden_size: 768, fft_size: 1024, fmax: 7600, fmin: 55, frames_multiple: 1, gen_dir_name: , hidden_size: 192, hop_size: 256, infer: False, infer_mask_ratio: 0.3, infer_post_glow: True, keep_bins: 80, kl_min: 0.0, kl_start_steps: 10000, lambda_commit: 0.25, lambda_energy: 0.1, lambda_f0: 1.0, lambda_kl: 1.0, lambda_mel_adv: 0.05, lambda_ph_dur: 0.1, lambda_sent_dur: 0.0, lambda_uv: 1.0, lambda_word_dur: 1.0, latent_cond_type: add, latent_size: 16, layers_in_block: 2, load_ckpt: , loud_norm: False, lr: 0.0002, mask_ratio: 0.12, mask_type: alignment_aware, max_beta: 0.06, max_epochs: 1000, max_frames: 1548, max_input_tokens: 1550, max_sentences: 16, max_tokens: 40000, max_updates: 2000000, max_valid_sentences: 1, max_valid_tokens: 60000, mel_disc_hidden_size: 128, mel_enc_layers: 4, mel_losses: l1:0.5|ssim:0.5, mel_vmax: 1.5, mel_vmin: -6, min_frames: 0, noise_scale: 0.8, num_ckpt_keep: 3, num_heads: 2, num_sanity_val_steps: 5, num_spk: 1261, num_valid_plots: 10, optimizer_adam_beta1: 0.9, optimizer_adam_beta2: 0.98, out_wav_norm: False, pitch_extractor: parselmouth, pitch_type: frame, posterior_start_steps: 0, predictor_dropout: 0.2, predictor_grad: 0.1, predictor_hidden: -1, predictor_kernel: 5, predictor_layers: 5, print_nan_grads: False, prior_flow_hidden: 64, prior_flow_kernel_size: 3, prior_flow_n_blocks: 4, processed_data_dir: data/processed/libritts, profile_infer: False, raw_data_dir: data/raw/libritts, ref_norm_layer: bn, rename_tmux: True, residual_channels: 256, residual_layers: 20, resume_from_checkpoint: 0, save_best: False, save_codes: ['tasks', 'modules'], save_f0: False, save_gt: True, schedule_type: vpsde, scheduler: warmup, seed: 1234, share_wn_layers: 4, sigmoid_scale: False, sort_by_len: True, spec_max: [], spec_min: [], task_cls: tasks.speech_editing.spec_denoiser.SpeechDenoiserTask, tb_log_interval: 100, test_ids: [], test_input_yaml: , test_num: 100, test_set_name: test, text_encoder_postnet: True, timescale: 1, timesteps: 8, train_set_name: train, train_sets: , training_mask_ratio: 0.8, two_stage: True, use_cond_proj: False, use_gt_dur: False, use_gt_f0: False, use_latent_cond: False, use_pitch_embed: True, use_pos_embed: True, use_post_flow: True, use_prior_flow: True, use_spk_embed: True, use_spk_id: False, use_txt_cond: True, use_uv: True, val_check_interval: 2000, valid_infer_interval: 2000, valid_monitor_key: val_loss, valid_monitor_mode: min, valid_set_name: valid, validate: False, vocoder: HifiGAN, vocoder_ckpt: pretrained/hifigan_hifitts, warmup_updates: 8000, weight_decay: 0, win_size: 1024, word_dict_size: 40500, work_dir: checkpoints/spec_denoiser, Traceback (most recent call last): File "inference/tts/spec_denoiser.py", line 352, in <module> SpecDenoiserInfer.example_run(dataset_info) File "inference/tts/spec_denoiser.py", line 255, in example_run infer_ins = cls(hp) File "inference/tts/spec_denoiser.py", line 42, in __init__ self.model = self.build_model() File "inference/tts/spec_denoiser.py", line 58, in build_model load_ckpt(model, hparams['work_dir'], 'model') File "/home/yinhaowen/fluentspeech/utils/commons/ckpt_utils.py", line 41, in load_ckpt state_dict = state_dict[model_name] KeyError: 'model'
I don't know what the value of model
should be, maybe I should do a pre-processing to regenerate the config.json
.
Is the pre-trained ckpt placed in the hparams['work_dir']?
@Zain-Jiang
Yes, I set the hparams['work_dir'] is checkpoints/spec_denoiser/model_ckpt_steps_2168000.ckpt
. keyError:model
is still exists.
When I debug it, the value of model_name
is'model'
, which does not contain '.'
, and the code findsmodel_name
in the dictionary state_dict
.
if checkpoint is not None:
state_dict = checkpoint["state_dict"]
if len([k for k in state_dict.keys() if '.' in k]) > 0:
state_dict = {k[len(model_name) + 1:]: v for k, v in state_dict.items()
if k.startswith(f'{model_name}.')}
else:
if '.' not in model_name:
state_dict = state_dict[model_name]
else:
base_model_name = model_name.split('.')[0]
rest_model_name = model_name[len(base_model_name) + 1:]
state_dict = {
k[len(rest_model_name) + 1:]: v for k, v in state_dict[base_model_name].items()
if k.startswith(f'{rest_model_name}.')}
However, the value "model"
is not included in state_dict
and It seems that setting work_dir
in config.yaml doesn't work because work_dir
is automatically set to checkpoints/"--exp_name"
in hparams.py
@Linghuxc hh, I find the reason.
The FluentSpeech model we provide is in the following link https://drive.google.com/drive/folders/1saqpWc4vrSgUZvRvHkf2QbwWSikMTyoo?usp=sharing
, and the checkpoint's name is model_ckpt_steps_568000.ckpt
.
Runing the inference code also needs a pre-trained HifiGAN Vocoder provided in https://drive.google.com/drive/folders/1n_0tROauyiAYGUDbmoQ__eqyT_G4RvjN?usp=sharing
, and the checkpoint's name is model_ckpt_steps_2168000.ckpt
.
Perhaps you have loaded HifiGAN to FluentSpeech.
FluentSpeech is used to edit the mel-spec and HifiGAN is the vocoder we used to transform mel-spec to wavform. I'm sure that if you load the model correctly, you can freely enjoy the beauty of speech editing.
@Zain-Jiang Oh! yes,I loaded the model incorrectly! Now I can successfully generate the edited audio. Thank you very much for your help! I am very sorry to delay your time due to my negligence!
Hi! I'm getting the following error when running
python inference/tts/spec_denoiser.py --exp_name spec_denoiser
. Where can I find the required files? I'm trying to run the basic pre-trained inference of FluentSpeech.