GPT 微调: ZeroDivisionError: division by zero

selfboot commented 10 months ago

GPT 微调报错，但是 1Ba-SoVITS训练是可以的。 GPT 的训练报错：

"/data/home/miniconda3/envs/GPTSoVits/bin/python" GPT_SoVITS/s1_train.py --config_file "TEMP/tmp_s1.yaml" 
Seed set to 1234
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
<All keys matched successfully>
ckpt_path: None
[rank: 0] Seed set to 1234
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

semantic_data_len: 0
phoneme_data_len: 5
Empty DataFrame
Columns: [item_name, semantic_audio]
Index: []
Traceback (most recent call last):
  File "/data/home/GPT-SoVITS/GPT_SoVITS/s1_train.py", line 171, in <module>
    main(args)
  File "/data/home/GPT-SoVITS/GPT_SoVITS/s1_train.py", line 147, in main
    trainer.fit(model, data_module, ckpt_path=ckpt_path)
  File "/data/home/miniconda3/envs/GPTSoVits/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "/data/home/miniconda3/envs/GPTSoVits/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/data/home/miniconda3/envs/GPTSoVits/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 102, in launch
    return function(*args, **kwargs)
  File "/data/home/miniconda3/envs/GPTSoVits/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/data/home/miniconda3/envs/GPTSoVits/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 950, in _run
    call._call_setup_hook(self)  # allow user to setup lightning_module in accelerator environment
  File "/data/home/miniconda3/envs/GPTSoVits/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 92, in _call_setup_hook
    _call_lightning_datamodule_hook(trainer, "setup", stage=fn)
  File "/data/home/miniconda3/envs/GPTSoVits/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 179, in _call_lightning_datamodule_hook
    return fn(*args, **kwargs)
  File "/data/home/GPT-SoVITS/GPT_SoVITS/AR/data/data_module.py", line 29, in setup
    self._train_dataset = Text2SemanticDataset(
  File "/data/home/GPT-SoVITS/GPT_SoVITS/AR/data/dataset.py", line 107, in __init__
    self.init_batch()
  File "/data/home/GPT-SoVITS/GPT_SoVITS/AR/data/dataset.py", line 187, in init_batch
    for _ in range(max(2, int(min_num / leng))):
ZeroDivisionError: division by zero

感觉前一步训练集格式化，开启 SSL 提取就有点问题，不知道和这个有关系没。

"/data/home/miniconda3/envs/GPTSoVits/bin/python" GPT_SoVITS/prepare_datasets/2-get-hubert-wav32k.py
"/data/home/miniconda3/envs/GPTSoVits/bin/python" GPT_SoVITS/prepare_datasets/2-get-hubert-wav32k.py
Some weights of the model checkpoint at GPT_SoVITS/pretrained_models/chinese-hubert-base were not used when initializing HubertModel: ['encoder.pos_conv_embed.conv.weight_g', 'encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing HubertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing HubertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at GPT_SoVITS/pretrained_models/chinese-hubert-base were not used when initializing HubertModel: ['encoder.pos_conv_embed.conv.weight_g', 'encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing HubertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing HubertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of HubertModel were not initialized from the model checkpoint at GPT_SoVITS/pretrained_models/chinese-hubert-base and are newly initialized: ['encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'encoder.pos_conv_embed.conv.parametrizations.weight.original0']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of HubertModel were not initialized from the model checkpoint at GPT_SoVITS/pretrained_models/chinese-hubert-base and are newly initialized: ['encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'encoder.pos_conv_embed.conv.parametrizations.weight.original0']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
"/data/home/miniconda3/envs/GPTSoVits/bin/python" GPT_SoVITS/prepare_datasets/3-get-semantic.py
"/data/home/miniconda3/envs/GPTSoVits/bin/python" GPT_SoVITS/prepare_datasets/3-get-semantic.py
/data/home/miniconda3/envs/GPTSoVits/lib/python3.9/site-packages/torch/nn/utils/weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
  warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
/data/home/miniconda3/envs/GPTSoVits/lib/python3.9/site-packages/torch/nn/utils/weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
  warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")

selfboot commented 10 months ago

看了下代码：

    def __init__(
        self,
        phoneme_path: str,
        semantic_path: str,
        max_sample: int = None,
        max_sec: int = 100,
        pad_val: int = 1024,
        # min value of phoneme/sec
        min_ps_ratio: int = 3,
        # max value of phoneme/sec
        max_ps_ratio: int = 25,
    ) -> None:
        super().__init__()

        self.semantic_data = pd.read_csv(
            semantic_path, delimiter="\t", encoding="utf-8"
        )

这里读到的数据为空，前面日志其实也打印了 semantic_data_len: 0

selfboot commented 10 months ago

这里 semantic_path 通过 yml 配置文件：TEMP/tmp_s1.yaml 读取，应该是：

看了下确实是空的：

$ cat logs/first/6-name2semantic.tsv
item_name       semantic_audio

selfboot commented 10 months ago

清理掉训练任务预处理的结果，然后重新生成数据集格式化文件，看了下文件对了。

(GPTSoVits) ➜  first git:(main) ✗ ls -alh
total 36K
drwxr-xr-x 6 zhao users 4.0K Jan 19 10:40 .
drwxr-xr-x 3 zhao users 4.0K Jan 19 10:40 ..
-rw-r--r-- 1 zhao users 2.3K Jan 19 10:40 2-name2text.txt
drwxr-xr-x 2 zhao users 4.0K Jan 19 10:40 3-bert
drwxr-xr-x 2 zhao users 4.0K Jan 19 10:40 4-cnhubert
drwxr-xr-x 2 zhao users 4.0K Jan 19 10:40 5-wav32k
-rw-r--r-- 1 zhao users 4.1K Jan 19 10:40 6-name2semantic.tsv
drwxr-xr-x 5 zhao users 4.0K Jan 19 10:40 logs_s1

看看6-name2semantic.tsv这个文件不空，里面数据对就能 gpt 微调。

c4fun commented 10 months ago

训练GPT的时候，遇到了一样的问题，但是我的semantic_data_len并不是0。而且这个跟数据集好像有关系，有的数据集会触发，有的不会

"/opt/anaconda3/envs/GPTSoVITS/bin/python" GPT_SoVITS/s1_train.py --config_file "TEMP/tmp_s1.yaml" 
Seed set to 1234
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
<All keys matched successfully>
ckpt_path: None
[rank: 0] Seed set to 1234
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

semantic_data_len: 238
phoneme_data_len: 239
                                             item_name                                     semantic_audio
0    vocal_milki-snowwhite.m4a_10.wav_15464960_1561...  913 171 657 773 281 766 30 639 882 973 758 991...
1    vocal_milki-snowwhite.m4a_10.wav_16661120_1682...  913 140 714 496 180 129 471 550 783 45 384 66 ...
2    vocal_milki-snowwhite.m4a_10.wav_41444480_4157...  520 280 280 105 271 41 65 509 773 178 247 251 ...
3    vocal_milki-snowwhite.m4a_10.wav_21096640_2123...  520 105 280 280 486 486 486 486 536 609 17 8 8...
4    vocal_milki-snowwhite.m4a_10.wav_26919040_2700...  8 995 595 12 344 187 187 187 46 964 11 777 602...
..                                                 ...                                                ...
233  vocal_milki-snowwhite.m4a_10.wav_28424960_2846...  208 17 515 59 591 994 318 490 312 569 55 595 1...
234  vocal_milki-snowwhite.m4a_10.wav_13336000_1349...  520 105 280 280 280 280 280 280 105 54 17 5 87...
235  vocal_milki-snowwhite.m4a_10.wav_12366080_1248...  54 360 140 570 14 605 74 733 796 550 467 364 8...
236  vocal_milki-snowwhite.m4a_10.wav_32966720_3316...  520 105 271 17 32 187 4 857 751 12 633 749 376...
237  vocal_milki-snowwhite.m4a_10.wav_22626240_2276...  520 280 280 536 1012 524 1002 718 621 211 553 ...

[238 rows x 2 columns]
Traceback (most recent call last):
  File "/home/xxx/code/github.com/temp/GPT-SoVITS/GPT_SoVITS/s1_train.py", line 171, in <module>
    main(args)
  File "/home/xxx/code/github.com/temp/GPT-SoVITS/GPT_SoVITS/s1_train.py", line 147, in main
    trainer.fit(model, data_module, ckpt_path=ckpt_path)
  File "/opt/anaconda3/envs/GPTSoVITS/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "/opt/anaconda3/envs/GPTSoVITS/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/opt/anaconda3/envs/GPTSoVITS/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 102, in launch
    return function(*args, **kwargs)
  File "/opt/anaconda3/envs/GPTSoVITS/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/opt/anaconda3/envs/GPTSoVITS/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 950, in _run
    call._call_setup_hook(self)  # allow user to setup lightning_module in accelerator environment
  File "/opt/anaconda3/envs/GPTSoVITS/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 92, in _call_setup_hook
    _call_lightning_datamodule_hook(trainer, "setup", stage=fn)
  File "/opt/anaconda3/envs/GPTSoVITS/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 179, in _call_lightning_datamodule_hook
    return fn(*args, **kwargs)
  File "/home/xxx/code/github.com/temp/GPT-SoVITS/GPT_SoVITS/AR/data/data_module.py", line 29, in setup
    self._train_dataset = Text2SemanticDataset(
  File "/home/xxx/code/github.com/temp/GPT-SoVITS/GPT_SoVITS/AR/data/dataset.py", line 107, in __init__
    self.init_batch()
  File "/home/xxx/code/github.com/temp/GPT-SoVITS/GPT_SoVITS/AR/data/dataset.py", line 187, in init_batch
    for _ in range(max(2, int(min_num / leng))):
ZeroDivisionError: division by zero

upbit commented 10 months ago

for _ in range(max(2, int(min_num / leng))):
ZeroDivisionError: division by zero

同样的情况，6-name2semantic.tsv 有内容但报错。看了下可能和pandas版本有关（我的是2.1.4）

这里读取后self.semantic_data.iloc[i, 0]其实取的是value（音频文件名实际是index列）

60:        self.semantic_data = pd.read_csv(
61:            semantic_path, delimiter="\t", encoding="utf-8"
62:        )

修正方法：对 self.semantic_data 做一次 reset_index()

ps: 有不报错的可以帮忙提供个pandas版本和6-name2semantic.tsv，我确认下这样改通用提个MR

zZxztxZz commented 10 months ago

我是在训练英语模型时遇到的这个问题，将文本标注文件里的|ZH|改成|en|，重新训练集格式化后解决了，不知道有没有共同性

bensenx commented 10 months ago

我是在训练英语模型时遇到的这个问题，将文本标注文件里的|ZH|改成|en|，重新训练集格式化后解决了，不知道有没有共同性

我和你是一样的问题，修改标注后成功了

FRO4TEN commented 10 months ago

Indeed, that is correct. One additional tip , after making the necessary annotations, please refrain from directly clicking the "一键三连" button in the formatting interface. Instead, it is advisable to execute the three steps separately. Otherwise, the reformatting will not be thorough and errors may still occur

daxige9527 commented 10 months ago

我也是这个问题，不过我vits训练和gpt训练都报错

"runtime\python" GPT_SoVITS/s2_train.py --config "TEMP/tmp_s2.json" INFO:ll:{'train': {'log_interval': 100, 'eval_interval': 500, 'seed': 1234, 'epochs': 12, 'learning_rate': 0.0001, 'betas': [0.8, 0.99], 'eps': 1e-09, 'batch_size': 8, 'fp16_run': True, 'lr_decay': 0.999875, 'segment_size': 20480, 'init_lr_ratio': 1, 'warmup_epochs': 0, 'c_mel': 45, 'c_kl': 1.0, 'text_low_lr_rate': 0.4, 'pretrained_s2G': 'GPT_SoVITS/pretrained_models/s2G488k.pth', 'pretrained_s2D': 'GPT_SoVITS/pretrained_models/s2D488k.pth', 'if_save_latest': True, 'if_save_every_weights': True, 'save_every_epoch': 4, 'gpu_numbers': '0'}, 'data': {'max_wav_value': 32768.0, 'sampling_rate': 32000, 'filter_length': 2048, 'hop_length': 640, 'win_length': 2048, 'n_mel_channels': 128, 'mel_fmin': 0.0, 'mel_fmax': None, 'add_blank': True, 'n_speakers': 300, 'cleaned_text': True, 'exp_dir': 'logs/ll'}, 'model': {'inter_channels': 192, 'hidden_channels': 192, 'filter_channels': 768, 'n_heads': 2, 'n_layers': 6, 'kernel_size': 3, 'p_dropout': 0.1, 'resblock': '1', 'resblock_kernel_sizes': [3, 7, 11], 'resblock_dilation_sizes': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'upsample_rates': [10, 8, 2, 2, 2], 'upsample_initial_channel': 512, 'upsample_kernel_sizes': [16, 16, 8, 2, 2], 'n_layers_q': 3, 'use_spectral_norm': False, 'gin_channels': 512, 'semantic_frame_rate': '25hz', 'freeze_quantizer': True}, 's2_ckpt_dir': 'logs/ll', 'content_module': 'cnhubert', 'save_weight_dir': 'SoVITS_weights', 'name': 'll', 'pretrain': None, 'resume_step': None} INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0 INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrierkey:1 with 1 nodes. logs/ll/2-name2text.txt Traceback (most recent call last): File "F:\GPT-SoVITS\GPT-SoVITS\GPT_SoVITS\s2train.py", line 402, in main() File "F:\GPT-SoVITS\GPT-SoVITS\GPT_SoVITS\s2_train.py", line 53, in main mp.spawn(run, nprocs=n_gpus, args=(ngpus, hps,)) File "F:\GPT-SoVITS\GPT-SoVITS\runtime\lib\site-packages\torch\multiprocessing\spawn.py", line 239, in spawn return start_processes(fn, args, nprocs, join, daemon, startmethod='spawn') File "F:\GPT-SoVITS\GPT-SoVITS\runtime\lib\site-packages\torch\multiprocessing\spawn.py", line 197, in startprocesses while not context.join(): File "F:\GPT-SoVITS\GPT-SoVITS\runtime\lib\site-packages\torch\multiprocessing\spawn.py", line 160, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "F:\GPT-SoVITS_\GPT-SoVITS\runtime\lib\site-packages\torch\multiprocessing\spawn.py", line 69, in wrap fn(i, *args) File "F:\GPT-SoVITS\GPT-SoVITS\GPT_SoVITS\s2_train.py", line 69, in run traindataset = TextAudioSpeakerLoader(hps.data)######## File "F:\GPT-SoVITS\GPT-SoVITS\GPT_SoVITS\module\datautils.py", line 55, in init for in range(max(2, int(min_num / leng))): ZeroDivisionError: division by zero

daxige9527 commented 10 months ago

我是在训练英语模型时遇到的这个问题，将文本标注文件里的|ZH|改成|en|，重新训练集格式化后解决了，不知道有没有共同性

我和你是一样的问题，修改标注后成功了

可以具体说一下吗谢谢

lakerwe commented 10 months ago

我是在训练英语模型时遇到的这个问题，将文本标注文件里的|ZH|改成|en|，重新训练集格式化后解决了，不知道有没有共同性

我也是这样，改标注后成功了

daxige9527 commented 10 months ago

你指的是.list里边的标注吗？

zZxztxZz commented 10 months ago

你指的是.list里边的标注吗？

将.list里的标注每行的|ZH|换成|en|，不过好像新版本不区分大小写了，我不知道这能否解决所有这类问题。

daxige9527 commented 10 months ago

对对对~！就是这个问题~！！！！感谢所有的朋友，已经解决~~~

Liu-yixi commented 10 months ago

对对对~！就是这个问题~！！！！感谢所有的朋友，已经解决~~~

请问你是训练的中文还是英文，我中文改成了小写后仍然都是除0错误。

daxige9527 commented 10 months ago

你仔细检查你的LIST文件，训练时大部分报错都是它出了问题

Liu-yixi commented 10 months ago

你仔细检查你的LIST文件，训练时大部分报错都是它出了问题

可以附一张能够成功训练的list文件截图么，单对比主页的example，我不是很能看出哪有问题。

FRO4TEN commented 9 months ago

@Liu-yixi 这一条可能有帮助

Indeed, that is correct. One additional tip , after making the necessary annotations, please refrain from directly clicking the "一键三连" button in the formatting interface. Instead, it is advisable to execute the three steps separately. Otherwise, the reformatting will not be thorough and errors may still occur

RVC-Boss commented 9 months ago

似乎这个问题已经解决了，注意训练集list文件格式。

landerhe commented 4 months ago

我是中文训练集出了问题,后来不用一键三连,单个单个反复点几次莫名其妙就好了

RVC-Boss / GPT-SoVITS

GPT 微调: ZeroDivisionError: division by zero #57