Cannot use paddle speech for voice cloning. Got this : ValueError: (InvalidArgument) Deserialize to tensor failed, maybe the loaded file is not a paddle model(expected file format: 0, but 589505315 found).

I wanted to test the paddlespeech repo to clone a voice . My target text is english. (is that possible?) Here are the steps that ive taken.

cloned the repo (/mnt/msd/users/arnav/ is my workspace) and installed dependencies
cd into PaddleSpeech/examples/aishell3/vc1/
downloaded and unzipped the fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip and pwg_aishell3_ckpt_0.5.zip
Ive moved them into appropriate folders and this is how my file tree looks.
saloni is the voice i want to clone
ive modified the voice_cloning.sh as:
```
#!/bin/bash
```

config_path=$1 train_output_path=$2 ckpt_name=$3 ge2e_params_path=$4 ref_audio_dir=$5

python3 /mnt/msd/users/arnav/PaddleSpeech/paddlespeech/t2s/exps/voice_cloning.py \ --am=fastspeech2_aishell3 \ --am_config=${config_path} \ --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \ --am_stat=dump/train/speech_stats.npy \ --voc=pwgan_aishell3 \ --voc_config=pwg_aishell3_ckpt_0.5/default.yaml \ --voc_ckpt=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \ --voc_stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \ --ge2e_params_path=${ge2e_params_path} \ --text="Hello my name is saloni" \ --input-dir=${ref_audio_dir} \ --output-dir=${train_output_path}/vc_syn \ --phones-dict=dump/phone_id_map.txt

- im running the code using : 
`CUDA_VISIBLE_DEVICES=0 ./voice_cloning.sh /mnt/msd/users/arnav/PaddleSpeech/examples/aishell3/vc1/conf/default.yaml /mnt/msd/users/arnav/PaddleSpeech/examples/aishell3/vc1/pretrained  fastspeech2_nosil_aishell3_vc1_ckpt_0.5 /mnt/msd/users/arnav/PaddleSpeech/examples/aishell3/vc1/local/ge2e_ckpt_0.3/step-3000000.pdparams /mnt/msd/users/arnav/PaddleSpeech/examples/aishell3/vc1/saloni`
- This is the output im getting

/bin/bash: /home/newzera/anaconda3/envs/paddlespeech/lib/libtinfo.so.6: no version information available (required by /bin/bash) /home/newzera/anaconda3/envs/paddlespeech/lib/python3.7/site-packages/librosa/core/constantq.py:1059: DeprecationWarning: np.complex is a deprecated alias for the builtin complex. To silence this warning, use complex by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.complex128 here. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations dtype=np.complex, /home/newzera/anaconda3/envs/paddlespeech/lib/python3.7/site-packages/_distutils_hack/init.py:33: UserWarning: Setuptools is replacing distutils. warnings.warn("Setuptools is replacing distutils.") ========Args======== am: fastspeech2_aishell3 am_ckpt: /mnt/msd/users/arnav/PaddleSpeech/examples/aishell3/vc1/pretrained/checkpoints/fastspeech2_nosil_aishell3_vc1_ckpt_0.5 am_config: /mnt/msd/users/arnav/PaddleSpeech/examples/aishell3/vc1/conf/default.yaml am_stat: dump/train/speech_stats.npy ge2e_params_path: /mnt/msd/users/arnav/PaddleSpeech/examples/aishell3/vc1/local/ge2e_ckpt_0.3/step-3000000.pdparams input_dir: /mnt/msd/users/arnav/PaddleSpeech/examples/aishell3/vc1/saloni ngpu: 1 output_dir: /mnt/msd/users/arnav/PaddleSpeech/examples/aishell3/vc1/pretrained/vc_syn phones_dict: dump/phone_id_map.txt text: Hello my name is saloni use_ecapa: false voc: pwgan_aishell3 voc_ckpt: pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz voc_config: pwg_aishell3_ckpt_0.5/default.yaml voc_stat: pwg_aishell3_ckpt_0.5/feats_stats.npy

========Config======== batch_size: 64 f0max: 400 f0min: 80 fmax: 7600 fmin: 80 fs: 24000 max_epoch: 200 model: adim: 384 aheads: 2 decoder_normalize_before: True dlayers: 4 dunits: 1536 duration_predictor_chans: 256 duration_predictor_kernel_size: 3 duration_predictor_layers: 2 elayers: 4 encoder_normalize_before: True energy_embed_dropout: 0.0 energy_embed_kernel_size: 1 energy_predictor_chans: 256 energy_predictor_dropout: 0.5 energy_predictor_kernel_size: 3 energy_predictor_layers: 2 eunits: 1536 init_dec_alpha: 1.0 init_enc_alpha: 1.0 init_type: xavier_uniform pitch_embed_dropout: 0.0 pitch_embed_kernel_size: 1 pitch_predictor_chans: 256 pitch_predictor_dropout: 0.5 pitch_predictor_kernel_size: 5 pitch_predictor_layers: 5 positionwise_conv_kernel_size: 3 positionwise_layer_type: conv1d postnet_chans: 256 postnet_filts: 5 postnet_layers: 5 reduction_factor: 1 spk_embed_dim: 256 spk_embed_integration_type: concat stop_gradient_from_energy_predictor: False stop_gradient_from_pitch_predictor: True transformer_dec_attn_dropout_rate: 0.2 transformer_dec_dropout_rate: 0.2 transformer_dec_positional_dropout_rate: 0.2 transformer_enc_attn_dropout_rate: 0.2 transformer_enc_dropout_rate: 0.2 transformer_enc_positional_dropout_rate: 0.2 use_scaled_pos_enc: True n_fft: 2048 n_mels: 80 n_shift: 300 num_snapshots: 5 num_workers: 2 optimizer: learning_rate: 0.001 optim: adam seed: 10086 updater: use_masking: True win_length: 1200 window: hann allow_cache: True batch_max_steps: 24000 batch_size: 8 discriminator_grad_norm: 1 discriminator_optimizer_params: epsilon: 1e-06 weight_decay: 0.0 discriminator_params: bias: True conv_channels: 64 in_channels: 1 kernel_size: 3 layers: 10 nonlinear_activation: LeakyReLU nonlinear_activation_params: negative_slope: 0.2 out_channels: 1 use_weight_norm: True discriminator_scheduler_params: gamma: 0.5 learning_rate: 5e-05 step_size: 200000 discriminator_train_start_steps: 100000 eval_interval_steps: 1000 fmax: 7600 fmin: 80 fs: 24000 generator_grad_norm: 10 generator_optimizer_params: epsilon: 1e-06 weight_decay: 0.0 generator_params: aux_channels: 80 aux_context_window: 2 dropout: 0.0 gate_channels: 128 in_channels: 1 kernel_size: 3 layers: 30 out_channels: 1 residual_channels: 64 skip_channels: 64 stacks: 3 upsample_scales: [4, 5, 3, 5] use_weight_norm: True generator_scheduler_params: gamma: 0.5 learning_rate: 0.0001 step_size: 200000 lambda_adv: 4.0 n_fft: 2048 n_mels: 80 n_shift: 300 num_save_intermediate_results: 4 num_snapshots: 10 num_workers: 4 pin_memory: True remove_short_samples: True save_interval_steps: 5000 seed: 42 stft_loss_params: fft_sizes: [1024, 2048, 512] hop_sizes: [120, 240, 50] win_lengths: [600, 1200, 240] window: hann train_max_steps: 1000000 win_length: 1200 window: hann Audio Processor Done! W0602 10:52:56.641338 144178 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 12.0, Runtime API Version: 10.2 W0602 10:52:56.642062 144178 gpu_resources.cc:91] device: 0, cuDNN Version: 8.8. GE2E Done! [2023-06-02 10:53:00,516] [ INFO] - Already cached /home/arnav-newzera/.paddlenlp/models/bert-base-chinese/bert-base-chinese-vocab.txt [2023-06-02 10:53:00,524] [ INFO] - tokenizer config file saved in /home/arnav-newzera/.paddlenlp/models/bert-base-chinese/tokenizer_config.json [2023-06-02 10:53:00,524] [ INFO] - Special tokens file saved in /home/arnav-newzera/.paddlenlp/models/bert-base-chinese/special_tokens_map.json frontend done! Building prefix dict from the default dictionary ... [2023-06-02 10:53:00] [DEBUG] [init.py:113] Building prefix dict from the default dictionary ... Loading model from cache /tmp/jieba.cache [2023-06-02 10:53:00] [DEBUG] [init.py:133] Loading model from cache /tmp/jieba.cache Loading model cost 0.499 seconds. [2023-06-02 10:53:01] [DEBUG] [init.py:165] Loading model cost 0.499 seconds. Prefix dict has been built successfully. [2023-06-02 10:53:01] [DEBUG] [init.py:166] Prefix dict has been built successfully. Traceback (most recent call last): File "/mnt/msd/users/arnav/PaddleSpeech/paddlespeech/t2s/exps/voice_cloning.py", line 233, in main() File "/mnt/msd/users/arnav/PaddleSpeech/paddlespeech/t2s/exps/voice_cloning.py", line 229, in main voice_cloning(args) File "/mnt/msd/users/arnav/PaddleSpeech/paddlespeech/t2s/exps/voice_cloning.py", line 106, in voice_cloning phones_dict=args.phones_dict) File "/home/newzera/anaconda3/envs/paddlespeech/lib/python3.7/site-packages/paddlespeech/t2s/exps/syn_utils.py", line 371, in get_am_inference am.set_state_dict(paddle.load(am_ckpt)["main_params"]) File "/home/newzera/anaconda3/envs/paddlespeech/lib/python3.7/site-packages/paddle/framework/io.py", line 1103, in load load_result = _legacy_load(path, *configs) File "/home/newzera/anaconda3/envs/paddlespeech/lib/python3.7/site-packages/paddle/framework/io.py", line 1150, in _legacy_load load_result = _load_state_dict_from_save_params(model_path) File "/home/newzera/anaconda3/envs/paddlespeech/lib/python3.7/site-packages/paddle/framework/io.py", line 147, in _load_state_dict_from_save_params attrs={'file_path': os.path.join(model_path, name)}, File "/home/newzera/anaconda3/envs/paddlespeech/lib/python3.7/site-packages/paddle/fluid/dygraph/tracer.py", line 314, in trace_op stop_gradient, inplace_map) File "/home/newzera/anaconda3/envs/paddlespeech/lib/python3.7/site-packages/paddle/fluid/dygraph/tracer.py", line 176, in eager_legacy_trace_op returns = function_ptr(arg_list, *attrs_list) ValueError: (InvalidArgument) Deserialize to tensor failed, maybe the loaded file is not a paddle model(expected file format: 0, but 589505315 found). [Hint: Expected version == 0U, but received version:589505315 != 0U:0.] (at /paddle/paddle/phi/core/serialization.cc:106) [operator < load > error]



Can anyone tell me what i did wrong? or how to resolve the error? Please be forgiving as im new to this.

PaddlePaddle / PaddleGAN

Cannot use paddle speech for voice cloning. Got this : ValueError: (InvalidArgument) Deserialize to tensor failed, maybe the loaded file is not a paddle model(expected file format: 0, but 589505315 found). #793