YYuX-1145 / Bert-VITS2-Integration-package

vits2 backbone with bert
https://www.bilibili.com/video/BV13p4y1d7v9
GNU Affero General Public License v3.0
332 stars 30 forks source link

开始训练出现问题,代码如下,尝试多次无法解决,求帮忙看下代码如何解决 #4

Closed sanshaoyeyang closed 1 year ago

sanshaoyeyang commented 1 year ago

WARNING:OUTPUT_MODEL:D:\AI\Bert-VITS2-Integration-Package is not a git repository, therefore hash value comparison will be ignored. INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0 INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes. skipped: 1 , total: 500 skipped: 0 , total: 4 Using noise scaled MAS for VITS2 Using duration discriminator for VITS2 256 2 256 2 256 2 256 2 256 2 ./logs./OUTPUT_MODEL\DUR_0.pth load INFO:OUTPUT_MODEL:Loaded checkpoint './logs./OUTPUT_MODEL\DUR_0.pth' (iteration 694) ./logs./OUTPUT_MODEL\G_0.pth error, emb_g.weight is not in the checkpoint load INFO:OUTPUT_MODEL:Loaded checkpoint './logs./OUTPUT_MODEL\G_0.pth' (iteration 0) ./logs./OUTPUT_MODEL\D_0.pth load INFO:OUTPUT_MODEL:Loaded checkpoint './logs./OUTPUT_MODEL\D_0.pth' (iteration 0) Exception ignored in: <function _MultiProcessingDataLoaderIter.del at 0x000002FBBA6BA790> Traceback (most recent call last): File "D:\AI\Bert-VITS2-Integration-Package\venv\lib\site-packages\torch\utils\data\dataloader.py", line 1466, in del self._shutdown_workers() File "D:\AI\Bert-VITS2-Integration-Package\venv\lib\site-packages\torch\utils\data\dataloader.py", line 1397, in _shutdown_workers if not self._shutdown: AttributeError: '_MultiProcessingDataLoaderIter' object has no attribute '_shutdown' Traceback (most recent call last): File "train_first.py", line 409, in main() File "train_first.py", line 58, in main mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,)) File "D:\AI\Bert-VITS2-Integration-Package\venv\lib\site-packages\torch\multiprocessing\spawn.py", line 240, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "D:\AI\Bert-VITS2-Integration-Package\venv\lib\site-packages\torch\multiprocessing\spawn.py", line 198, in start_processes while not context.join(): File "D:\AI\Bert-VITS2-Integration-Package\venv\lib\site-packages\torch\multiprocessing\spawn.py", line 160, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "D:\AI\Bert-VITS2-Integration-Package\venv\lib\site-packages\torch\multiprocessing\spawn.py", line 69, in _wrap fn(i, args) File "D:\AI\Bert-VITS2-Integration-Package\train_first.py", line 191, in run train_and_evaluate(rank, epoch, hps, [net_g, net_d, net_dur_disc], [optim_g, optim_d, optim_dur_disc], [scheduler_g, scheduler_d, scheduler_dur_disc], scaler, [train_loader, eval_loader], logger, [writer, writer_eval]) File "D:\AI\Bert-VITS2-Integration-Package\train_first.py", line 215, in train_and_evaluate for batch_idx, (x, x_lengths, spec, spec_lengths, y, y_lengths, speakers, tone, language, bert) in tqdm(enumerate(train_loader)): File "D:\AI\Bert-VITS2-Integration-Package\venv\lib\site-packages\torch\utils\data\dataloader.py", line 430, in iter self._iterator = self._get_iterator() File "D:\AI\Bert-VITS2-Integration-Package\venv\lib\site-packages\torch\utils\data\dataloader.py", line 381, in _get_iterator return _MultiProcessingDataLoaderIter(self) File "D:\AI\Bert-VITS2-Integration-Package\venv\lib\site-packages\torch\utils\data\dataloader.py", line 988, in init super(_MultiProcessingDataLoaderIter, self).init(loader) File "D:\AI\Bert-VITS2-Integration-Package\venv\lib\site-packages\torch\utils\data\dataloader.py", line 598, in init self._sampler_iter = iter(self._index_sampler) File "D:\AI\Bert-VITS2-Integration-Package\data_utils.py", line 309, in iter ids_bucket = ids_bucket + ids_bucket (rem // len_bucket) + ids_bucket[:(rem % len_bucket)] ZeroDivisionError: integer division or modulo by zero

2737473112 commented 1 year ago

我也是这样,error, emb_g.weight is not in the checkpoint

Slldyd2077 commented 1 year ago

同问qwq

sanshaoyeyang commented 1 year ago

经过痛苦的反复尝试,包括不限于更换pytorch、cuda、代码修改,都不起作用 最终换了个小数据集实验,居然跑通了,问题应该是某段数据被跳过导致读取了空数据引起的 换了个数据集之后重新跑一遍,训练代码终于正常跑起来了,问题完结!

2737473112 commented 1 year ago

感谢,我在减小数据集到十几条的时候也跑起来了。不过正常训练不能只有这么小的数据集,接下来我还要再看看到底是哪段数据的缺失,给代码补上跳过的机制

YYuX-1145 commented 1 year ago

我也是这样,error, emb_g.weight is not in the checkpoint

这不是错误,首次训练你的说话人和底膜对不上就会这样,是正常的

YYuX-1145 commented 1 year ago

经过痛苦的反复尝试,包括不限于更换pytorch、cuda、代码修改,都不起作用 最终换了个小数据集实验,居然跑通了,问题应该是某段数据被跳过导致读取了空数据引起的 换了个数据集之后重新跑一遍,训练代码终于正常跑起来了,问题完结!

我怀疑是num_workers的数量问题,我设为4了。你的显卡是桌面4090对吗?那么方便说一下内存多大,数据集长度有没有过长呢,有没有爆显存呢?我自己16G显存+64G内存在我这个设定下是能训没分割的纳西妲语音的(1577条,最长的有30秒)你把它设为别的数,比如2或者0试试看?

YYuX-1145 commented 1 year ago

经过痛苦的反复尝试,包括不限于更换pytorch、cuda、代码修改,都不起作用 最终换了个小数据集实验,居然跑通了,问题应该是某段数据被跳过导致读取了空数据引起的 换了个数据集之后重新跑一遍,训练代码终于正常跑起来了,问题完结!

AttributeError: '_MultiProcessingDataLoaderIter' object has no attribute '_shutdown'这个错误搜了一下似乎是内存不足导致的,如果可以的话能不能复现一下错误再记录一下内存使用啊?如果物理内存没满是不是虚拟内存不够?原作者设置的24我在第一次跑的时候已提交内存已经超过128G了

webchong commented 1 year ago

一模一样的报错,已经搞了三天了,还是没解决,3090 24G显存,内存64G,虚拟内存300G,10s一段的音频共50段,总时长8分多钟,只要输入训练命令就是一模一样的报错,环境已更新到了最新

Sakuraiina commented 1 year ago

补充: 用的是最新的,2023年9月3日晚上九点传的那个版本 采样率44.1

E:\VITS_Bert>%PYTHON% train_first.py -c ./configs\config.json INFO:OUTPUT_MODEL:{'train': {'log_interval': 10, 'eval_interval': 100, 'seed': 52, 'epochs': 1000, 'learning_rate': 0.0002, 'betas': [0.8, 0.99], 'eps': 1e-09, 'batch_size': 10, 'fp16_run': False, 'lr_decay': 0.999875, 'segment_size': 16384, 'init_lr_ratio': 1, 'warmup_epochs': 0, 'c_mel': 45, 'c_kl': 1.0}, 'data': {'use_mel_posterior_encoder': False, 'training_files': 'filelists/train.list', 'validation_files': 'filelists/val.list', 'max_wav_value': 32768.0, 'sampling_rate': 44100, 'filter_length': 2048, 'hop_length': 512, 'win_length': 2048, 'n_mel_channels': 128, 'mel_fmin': 0.0, 'mel_fmax': None, 'add_blank': True, 'n_speakers': 1, 'cleaned_text': True, 'spk2id': {'Warma': 0}}, 'model': {'use_spk_conditioned_encoder': True, 'use_noise_scaled_mas': True, 'use_mel_posterior_encoder': False, 'use_duration_discriminator': True, 'inter_channels': 192, 'hidden_channels': 192, 'filter_channels': 768, 'n_heads': 2, 'n_layers': 6, 'kernel_size': 3, 'p_dropout': 0.1, 'resblock': '1', 'resblock_kernel_sizes': [3, 7, 11], 'resblock_dilation_sizes': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'upsample_rates': [8, 8, 2, 2, 2], 'upsample_initial_channel': 512, 'upsample_kernel_sizes': [16, 16, 8, 2, 2], 'n_layers_q': 3, 'use_spectral_norm': False, 'gin_channels': 256}, 'model_dir': './logs\./OUTPUT_MODEL'} WARNING:OUTPUT_MODEL:E:\VITS_Bert is not a git repository, therefore hash value comparison will be ignored. INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0 INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes. skipped: 11 , total: 230 skipped: 0 , total: 4 Using noise scaled MAS for VITS2 Using duration discriminator for VITS2 256 2 256 2 256 2 256 2 256 2 ./logs./OUTPUT_MODEL\DUR_0.pth load INFO:OUTPUT_MODEL:Loaded checkpoint './logs./OUTPUT_MODEL\DUR_0.pth' (iteration 694) ./logs./OUTPUT_MODEL\G_0.pth error, emb_g.weight is not in the checkpoint load INFO:OUTPUT_MODEL:Loaded checkpoint './logs./OUTPUT_MODEL\G_0.pth' (iteration 0) ./logs./OUTPUT_MODEL\D_0.pth load INFO:OUTPUT_MODEL:Loaded checkpoint './logs./OUTPUT_MODEL\D_0.pth' (iteration 0) Exception ignored in: <function _MultiProcessingDataLoaderIter.del at 0x0000023BDCBB5AF0> Traceback (most recent call last): File "E:\VITS_Bert\venv\lib\site-packages\torch\utils\data\dataloader.py", line 1466, in del self._shutdown_workers() File "E:\VITS_Bert\venv\lib\site-packages\torch\utils\data\dataloader.py", line 1397, in _shutdown_workers if not self._shutdown: AttributeError: '_MultiProcessingDataLoaderIter' object has no attribute '_shutdown' Traceback (most recent call last): File "train_first.py", line 409, in main() File "train_first.py", line 58, in main mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,)) File "E:\VITS_Bert\venv\lib\site-packages\torch\multiprocessing\spawn.py", line 240, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "E:\VITS_Bert\venv\lib\site-packages\torch\multiprocessing\spawn.py", line 198, in start_processes while not context.join(): File "E:\VITS_Bert\venv\lib\site-packages\torch\multiprocessing\spawn.py", line 160, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "E:\VITS_Bert\venv\lib\site-packages\torch\multiprocessing\spawn.py", line 69, in _wrap fn(i, args) File "E:\VITS_Bert\train_first.py", line 191, in run train_and_evaluate(rank, epoch, hps, [net_g, net_d, net_dur_disc], [optim_g, optim_d, optim_dur_disc], [scheduler_g, scheduler_d, scheduler_dur_disc], scaler, [train_loader, eval_loader], logger, [writer, writer_eval]) File "E:\VITS_Bert\train_first.py", line 215, in train_and_evaluate for batch_idx, (x, x_lengths, spec, spec_lengths, y, y_lengths, speakers, tone, language, bert) in tqdm(enumerate(train_loader)): File "E:\VITS_Bert\venv\lib\site-packages\torch\utils\data\dataloader.py", line 430, in iter self._iterator = self._get_iterator() File "E:\VITS_Bert\venv\lib\site-packages\torch\utils\data\dataloader.py", line 381, in _get_iterator return _MultiProcessingDataLoaderIter(self) File "E:\VITS_Bert\venv\lib\site-packages\torch\utils\data\dataloader.py", line 988, in init super(_MultiProcessingDataLoaderIter, self).init(loader) File "E:\VITS_Bert\venv\lib\site-packages\torch\utils\data\dataloader.py", line 598, in init self._sampler_iter = iter(self._index_sampler) File "E:\VITS_Bert\data_utils.py", line 309, in iter ids_bucket = ids_bucket + ids_bucket (rem // len_bucket) + ids_bucket[:(rem % len_bucket)] ZeroDivisionError: integer division or modulo by zero

从输入指令到结尾报错全都在这,3090卡24G显存,win10 128G内存,数据集10s以下包括10s一共有246条,标注的时候模型用的medium,config下的batch_size是10 没有明白为什么报错。

尝试过其他TTS项目:VITS fast-fine-tuning 也有很严重的问题,同上的数据集跑一会就torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with exit code 3221225477,改小过batch_size为2、4、8都不行,模型也是medium。但至少还能练一点才炸,这个完全不知道什么情况,直接跑不起来

YYuX-1145 commented 1 year ago

补充: 用的是最新的,2023年9月3日晚上九点传的那个版本 采样率44.1

E:\VITS_Bert>%PYTHON% train_first.py -c ./configs\config.json INFO:OUTPUT_MODEL:{'train': {'log_interval': 10, 'eval_interval': 100, 'seed': 52, 'epochs': 1000, 'learning_rate': 0.0002, 'betas': [0.8, 0.99], 'eps': 1e-09, 'batch_size': 10, 'fp16_run': False, 'lr_decay': 0.999875, 'segment_size': 16384, 'init_lr_ratio': 1, 'warmup_epochs': 0, 'c_mel': 45, 'c_kl': 1.0}, 'data': {'use_mel_posterior_encoder': False, 'training_files': 'filelists/train.list', 'validation_files': 'filelists/val.list', 'max_wav_value': 32768.0, 'sampling_rate': 44100, 'filter_length': 2048, 'hop_length': 512, 'win_length': 2048, 'n_mel_channels': 128, 'mel_fmin': 0.0, 'mel_fmax': None, 'add_blank': True, 'n_speakers': 1, 'cleaned_text': True, 'spk2id': {'Warma': 0}}, 'model': {'use_spk_conditioned_encoder': True, 'use_noise_scaled_mas': True, 'use_mel_posterior_encoder': False, 'use_duration_discriminator': True, 'inter_channels': 192, 'hidden_channels': 192, 'filter_channels': 768, 'n_heads': 2, 'n_layers': 6, 'kernel_size': 3, 'p_dropout': 0.1, 'resblock': '1', 'resblock_kernel_sizes': [3, 7, 11], 'resblock_dilation_sizes': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'upsample_rates': [8, 8, 2, 2, 2], 'upsample_initial_channel': 512, 'upsample_kernel_sizes': [16, 16, 8, 2, 2], 'n_layers_q': 3, 'use_spectral_norm': False, 'gin_channels': 256}, 'model_dir': './logs./OUTPUT_MODEL'} WARNING:OUTPUT_MODEL:E:\VITS_Bert is not a git repository, therefore hash value comparison will be ignored. INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0 INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes. skipped: 11 , total: 230 skipped: 0 , total: 4 Using noise scaled MAS for VITS2 Using duration discriminator for VITS2 256 2 256 2 256 2 256 2 256 2 ./logs./OUTPUT_MODEL\DUR_0.pth load INFO:OUTPUT_MODEL:Loaded checkpoint './logs./OUTPUT_MODEL\DUR_0.pth' (iteration 694) ./logs./OUTPUT_MODEL\G_0.pth error, emb_g.weight is not in the checkpoint load INFO:OUTPUT_MODEL:Loaded checkpoint './logs./OUTPUT_MODEL\G_0.pth' (iteration 0) ./logs./OUTPUT_MODEL\D_0.pth load INFO:OUTPUT_MODEL:Loaded checkpoint './logs./OUTPUT_MODEL\D_0.pth' (iteration 0) Exception ignored in: <function _MultiProcessingDataLoaderIter.del at 0x0000023BDCBB5AF0> Traceback (most recent call last): File "E:\VITS_Bert\venv\lib\site-packages\torch\utils\data\dataloader.py", line 1466, in del self._shutdown_workers() File "E:\VITS_Bert\venv\lib\site-packages\torch\utils\data\dataloader.py", line 1397, in _shutdown_workers if not self._shutdown: AttributeError: '_MultiProcessingDataLoaderIter' object has no attribute '_shutdown' Traceback (most recent call last): File "train_first.py", line 409, in main() File "train_first.py", line 58, in main mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,)) File "E:\VITS_Bert\venv\lib\site-packages\torch\multiprocessing\spawn.py", line 240, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "E:\VITS_Bert\venv\lib\site-packages\torch\multiprocessing\spawn.py", line 198, in start_processes while not context.join(): File "E:\VITS_Bert\venv\lib\site-packages\torch\multiprocessing\spawn.py", line 160, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "E:\VITS_Bert\venv\lib\site-packages\torch\multiprocessing\spawn.py", line 69, in _wrap fn(i, args) File "E:\VITS_Bert\train_first.py", line 191, in run train_and_evaluate(rank, epoch, hps, [net_g, net_d, net_dur_disc], [optim_g, optim_d, optim_dur_disc], [scheduler_g, scheduler_d, scheduler_dur_disc], scaler, [train_loader, eval_loader], logger, [writer, writer_eval]) File "E:\VITS_Bert\train_first.py", line 215, in train_and_evaluate for batch_idx, (x, x_lengths, spec, spec_lengths, y, y_lengths, speakers, tone, language, bert) in tqdm(enumerate(train_loader)): File "E:\VITS_Bert\venv\lib\site-packages\torch\utils\data\dataloader.py", line 430, in iter self._iterator = self._get_iterator() File "E:\VITS_Bert\venv\lib\site-packages\torch\utils\data\dataloader.py", line 381, in _get_iterator return _MultiProcessingDataLoaderIter(self) File "E:\VITS_Bert\venv\lib\site-packages\torch\utils\data\dataloader.py", line 988, in init super(_MultiProcessingDataLoaderIter, self).init(loader) File "E:\VITS_Bert\venv\lib\site-packages\torch\utils\data\dataloader.py", line 598, in init self._sampler_iter = iter(self._index_sampler) File "E:\VITS_Bert\data_utils.py", line 309, in iter ids_bucket = ids_bucket + ids_bucket (rem // len_bucket) + ids_bucket[:(rem % len_bucket)] ZeroDivisionError: integer division or modulo by zero

从输入指令到结尾报错全都在这,3090卡24G显存,win10 128G内存,数据集10s以下包括10s一共有246条,标注的时候模型用的medium,config下的batch_size是10 没有明白为什么报错。

尝试过其他TTS项目:VITS fast-fine-tuning 也有很严重的问题,同上的数据集跑一会就torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with exit code 3221225477,改小过batch_size为2、4、8都不行,模型也是medium。但至少还能练一点才炸,这个完全不知道什么情况,直接跑不起来

9.3日的版本训练已经改用train_ms.py了,其他文件也请更新一下,不再使用的模块记得删掉

xu-jia-ming commented 1 month ago

经过痛苦的反复尝试,包括不限于更换pytorch、cuda、代码修改,都不起作用 最终换了个小数据集实验,居然跑通了,问题应该是某段数据被跳过导致读取了空数据引起的 换了个数据集之后重新跑一遍,训练代码终于正常跑起来了,问题完结!

一模一样的经历, 我也是把数据集改小了就好了, 之前一直报错channel不匹配, 真的烦死我了, 要是找点看到这个就好了