RVC-Boss / GPT-SoVITS

1 min voice data can also be used to train a good TTS model! (few shot voice cloning)
MIT License
32.47k stars 3.74k forks source link

RuntimeError: use_libuv was requested but PyTorch was build without libuv support #1357

Open kellenyuan opened 1 month ago

kellenyuan commented 1 month ago

-- Process 0 terminated with the following error: Traceback (most recent call last): File "C:\Users\11500.conda\envs\GPTSoVits\lib\site-packages\torch\multiprocessing\spawn.py", line 76, in _wrap fn(i, args) File "D:\software\ai\GPT-SoVITS-beta0706\GPT_SoVITS\s2_train.py", line 75, in run dist.init_process_group( File "C:\Users\11500.conda\envs\GPTSoVits\lib\site-packages\torch\distributed\c10d_logger.py", line 79, in wrapper return func(args, *kwargs) File "C:\Users\11500.conda\envs\GPTSoVits\lib\site-packages\torch\distributed\c10d_logger.py", line 93, in wrapper func_return = func(args, **kwargs) File "C:\Users\11500.conda\envs\GPTSoVits\lib\site-packages\torch\distributed\distributed_c10d.py", line 1361, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "C:\Users\11500.conda\envs\GPTSoVits\lib\site-packages\torch\distributed\rendezvous.py", line 258, in _env_rendezvous_handler store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv) File "C:\Users\11500.conda\envs\GPTSoVits\lib\site-packages\torch\distributed\rendezvous.py", line 185, in _create_c10d_store return TCPStore( RuntimeError: use_libuv was requested but PyTorch was build without libuv support

"C:\Users\11500.conda\envs\GPTSoVits\python.exe" GPT_SoVITS/s2_train.py --config "D:\software\ai\GPT-SoVITS-beta0706\TEMP/tmp_s2.json" INFO:IceGirl:{'train': {'log_interval': 100, 'eval_interval': 500, 'seed': 1234, 'epochs': 8, 'learning_rate': 0.0001, 'betas': [0.8, 0.99], 'eps': 1e-09, 'batch_size': 2, 'fp16_run': False, 'lr_decay': 0.999875, 'segment_size': 20480, 'init_lr_ratio': 1, 'warmup_epochs': 0, 'c_mel': 45, 'c_kl': 1.0, 'text_low_lr_rate': 0.4, 'pretrained_s2G': 'GPT_SoVITS/pretrained_models/s2G488k.pth', 'pretrained_s2D': 'GPT_SoVITS/pretrained_models/s2D488k.pth', 'if_save_latest': True, 'if_save_every_weights': True, 'save_every_epoch': 4, 'gpu_numbers': '0'}, 'data': {'max_wav_value': 32768.0, 'sampling_rate': 32000, 'filter_length': 2048, 'hop_length': 640, 'win_length': 2048, 'n_mel_channels': 128, 'mel_fmin': 0.0, 'mel_fmax': None, 'add_blank': True, 'n_speakers': 300, 'cleaned_text': True, 'exp_dir': 'logs/IceGirl'}, 'model': {'inter_channels': 192, 'hidden_channels': 192, 'filter_channels': 768, 'n_heads': 2, 'n_layers': 6, 'kernel_size': 3, 'p_dropout': 0.1, 'resblock': '1', 'resblock_kernel_sizes': [3, 7, 11], 'resblock_dilation_sizes': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'upsample_rates': [10, 8, 2, 2, 2], 'upsample_initial_channel': 512, 'upsample_kernel_sizes': [16, 16, 8, 2, 2], 'n_layers_q': 3, 'use_spectral_norm': False, 'gin_channels': 512, 'semantic_frame_rate': '25hz', 'freeze_quantizer': True}, 's2_ckpt_dir': 'logs/IceGirl', 'content_module': 'cnhubert', 'save_weight_dir': 'SoVITS_weights', 'name': 'IceGirl', 'pretrain': None, 'resume_step': None} Traceback (most recent call last): File "D:\software\ai\GPT-SoVITS-beta0706\GPT_SoVITS\s2_train.py", line 600, in main() File "D:\software\ai\GPT-SoVITS-beta0706\GPT_SoVITS\s2_train.py", line 56, in main mp.spawn( File "C:\Users\11500.conda\envs\GPTSoVits\lib\site-packages\torch\multiprocessing\spawn.py", line 282, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method="spawn") File "C:\Users\11500.conda\envs\GPTSoVits\lib\site-packages\torch\multiprocessing\spawn.py", line 238, in start_processes while not context.join(): File "C:\Users\11500.conda\envs\GPTSoVits\lib\site-packages\torch\multiprocessing\spawn.py", line 189, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "C:\Users\11500.conda\envs\GPTSoVits\lib\site-packages\torch\multiprocessing\spawn.py", line 76, in _wrap fn(i, args) File "D:\software\ai\GPT-SoVITS-beta0706\GPT_SoVITS\s2_train.py", line 75, in run dist.init_process_group( File "C:\Users\11500.conda\envs\GPTSoVits\lib\site-packages\torch\distributed\c10d_logger.py", line 79, in wrapper return func(args, *kwargs) File "C:\Users\11500.conda\envs\GPTSoVits\lib\site-packages\torch\distributed\c10d_logger.py", line 93, in wrapper func_return = func(args, **kwargs) File "C:\Users\11500.conda\envs\GPTSoVits\lib\site-packages\torch\distributed\distributed_c10d.py", line 1361, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "C:\Users\11500.conda\envs\GPTSoVits\lib\site-packages\torch\distributed\rendezvous.py", line 258, in _env_rendezvous_handler store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv) File "C:\Users\11500.conda\envs\GPTSoVits\lib\site-packages\torch\distributed\rendezvous.py", line 185, in _create_c10d_store return TCPStore( RuntimeError: use_libuv was requested but PyTorch was build without libuv support

yuncengshangdepingyuan commented 1 month ago

我通过降版本解决了这个问题,试试这个(I solved the issue by downgrading, try this): pip3 uninstall torch torchvision torchaudio, 然后(and then): pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu118 Looking in indexes: https://download.pytorch.org/whl/cu118

TEGRAXD commented 1 month ago

Downgrading PyTorch to version 2.3.x will solve the issue. Mine was 2.3.1+cu121.

grizzlybearg commented 1 month ago

Instead of downgrading, you could follow the instructions as seen in https://pytorch.org/tutorials/intermediate/TCPStore_libuv_backend.html to disable libuv @kellenyuan

MXXXXXS commented 1 month ago

Instead of downgrading, you could follow the instructions as seen in https://pytorch.org/tutorials/intermediate/TCPStore_libuv_backend.html to disable libuv @kellenyuan

我在Retrieval-based-Voice-Conversion-WebUI也遇到这个问题了, 全局搜索找到init_method="env://"改成init_method="env://?use_libuv=False"

Huiyicc commented 1 month ago

from GPT_SoVITS/s2_train.py:78

    dist.init_process_group(
        backend = "gloo" if os.name == "nt" or not torch.cuda.is_available() else "nccl",
        init_method="env://",
        world_size=n_gpus,
        rank=rank,
    )

init_method="env://" -> init_method="env://?use_libuv=False"

TheChosenPerson commented 1 month ago

Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1 Traceback (most recent call last): File "E:\ptojects\Models\GPT-SoVITS-main\GPT-SoVITS-main\GPT_SoVITS\s1_train.py", line 183, in main(args) File "E:\ptojects\Models\GPT-SoVITS-main\GPT-SoVITS-main\GPT_SoVITS\s1_train.py", line 159, in main trainer.fit(model, data_module, ckpt_path=ckpt_path) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 538, in fit call._call_and_handle_interrupt( File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\trainer\call.py", line 46, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, kwargs) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\strategies\launchers\subprocess_script.py", line 105, in launch File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\strategies\launchers\subprocess_script.py", line 105, in launch return function(args, kwargs) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 574, in _fit_impl File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 574, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 937, in _run self.strategy.setup_environment() File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\strategies\ddp.py", line 154, in setup_environment self.setup_distributed() File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\strategies\ddp.py", line 203, in setup_distributed _init_dist_connection(self.cluster_environment, self._process_group_backend, timeout=self._timeout) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\lightning_fabric\utilities\distributed.py", line 297, in _init_dist_connection torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, kwargs) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\torch\distributed\c10d_logger.py", line 79, in wrapper return func(args, kwargs) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\torch\distributed\c10d_logger.py", line 93, in wrapper func_return = func(*args, **kwargs) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\torch\distributed\distributed_c10d.py", line 1361, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\torch\distributed\rendezvous.py", line 258, in _env_rendezvous_handler store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\torch\distributed\rendezvous.py", line 185, in _create_c10d_store return TCPStore( RuntimeError: use_libuv was requested but PyTorch was build without libuv support

Huiyicc commented 1 month ago

Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1 Traceback (most recent call last): File "E:\ptojects\Models\GPT-SoVITS-main\GPT-SoVITS-main\GPT_SoVITS\s1_train.py", line 183, in main(args) File "E:\ptojects\Models\GPT-SoVITS-main\GPT-SoVITS-main\GPT_SoVITS\s1_train.py", line 159, in main trainer.fit(model, data_module, ckpt_path=ckpt_path) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 538, in fit call._call_and_handle_interrupt( File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\trainer\call.py", line 46, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, kwargs) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\strategies\launchers\subprocess_script.py", line 105, in launch File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\strategies\launchers\subprocess_script.py", line 105, in launch return function(args, kwargs) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 574, in _fit_impl File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 574, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 937, in _run self.strategy.setup_environment() File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\strategies\ddp.py", line 154, in setup_environment self.setup_distributed() File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\strategies\ddp.py", line 203, in setup_distributed _init_dist_connection(self.cluster_environment, self._process_group_backend, timeout=self._timeout) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\lightning_fabric\utilities\distributed.py", line 297, in _init_dist_connection torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, kwargs) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\torch\distributed\c10d_logger.py", line 79, in wrapper return func(args, kwargs) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\torch\distributed\c10d_logger.py", line 93, in wrapper func_return = func(*args, **kwargs) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\torch\distributed\distributed_c10d.py", line 1361, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\torch\distributed\rendezvous.py", line 258, in _env_rendezvous_handler store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\torch\distributed\rendezvous.py", line 185, in _create_c10d_store return TCPStore( RuntimeError: use_libuv was requested but PyTorch was build without libuv support

按上面我回复的改就行了

TheChosenPerson commented 1 month ago

按照你的回复能解决一处,这是另一处trainer.fit

NadekoShiro commented 1 month ago

Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1 Traceback (most recent call last): File "E:\ptojects\Models\GPT-SoVITS-main\GPT-SoVITS-main\GPT_SoVITS\s1_train.py", line 183, in main(args) File "E:\ptojects\Models\GPT-SoVITS-main\GPT-SoVITS-main\GPT_SoVITS\s1_train.py", line 159, in main trainer.fit(model, data_module, ckpt_path=ckpt_path) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 538, in fit call._call_and_handle_interrupt( File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\trainer\call.py", line 46, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, kwargs) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\strategies\launchers\subprocess_script.py", line 105, in launch File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\strategies\launchers\subprocess_script.py", line 105, in launch return function(args, kwargs) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 574, in _fit_impl File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 574, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 937, in _run self.strategy.setup_environment() File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\strategies\ddp.py", line 154, in setup_environment self.setup_distributed() File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\strategies\ddp.py", line 203, in setup_distributed _init_dist_connection(self.cluster_environment, self._process_group_backend, timeout=self._timeout) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\lightning_fabric\utilities\distributed.py", line 297, in _init_dist_connection torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, kwargs) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\torch\distributed\c10d_logger.py", line 79, in wrapper return func(args, kwargs) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\torch\distributed\c10d_logger.py", line 93, in wrapper func_return = func(*args, **kwargs) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\torch\distributed\distributed_c10d.py", line 1361, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\torch\distributed\rendezvous.py", line 258, in _env_rendezvous_handler store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\torch\distributed\rendezvous.py", line 185, in _create_c10d_store return TCPStore( RuntimeError: use_libuv was requested but PyTorch was build without libuv support

按上面我回复的改就行了

按照大佬的方法可以解决 SoVITS 模型训练的问题,但是 GPT 模型还是会出现这个错误。

Huiyicc commented 1 month ago

Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1 Traceback (most recent call last): File "E:\ptojects\Models\GPT-SoVITS-main\GPT-SoVITS-main\GPT_SoVITS\s1_train.py", line 183, in main(args) File "E:\ptojects\Models\GPT-SoVITS-main\GPT-SoVITS-main\GPT_SoVITS\s1_train.py", line 159, in main trainer.fit(model, data_module, ckpt_path=ckpt_path) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 538, in fit call._call_and_handle_interrupt( File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\trainer\call.py", line 46, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, kwargs) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\strategies\launchers\subprocess_script.py", line 105, in launch File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\strategies\launchers\subprocess_script.py", line 105, in launch return function(args, kwargs) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 574, in _fit_impl File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 574, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 937, in _run self.strategy.setup_environment() File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\strategies\ddp.py", line 154, in setup_environment self.setup_distributed() File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\strategies\ddp.py", line 203, in setup_distributed _init_dist_connection(self.cluster_environment, self._process_group_backend, timeout=self._timeout) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\lightning_fabric\utilities\distributed.py", line 297, in _init_dist_connection torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, kwargs) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\torch\distributed\c10d_logger.py", line 79, in wrapper return func(args, kwargs) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\torch\distributed\c10d_logger.py", line 93, in wrapper func_return = func(*args, **kwargs) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\torch\distributed\distributed_c10d.py", line 1361, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\torch\distributed\rendezvous.py", line 258, in _env_rendezvous_handler store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\torch\distributed\rendezvous.py", line 185, in _create_c10d_store return TCPStore( RuntimeError: use_libuv was requested but PyTorch was build without libuv support

按上面我回复的改就行了

按照大佬的方法可以解决 SoVITS 模型训练的问题,但是 GPT 模型还是会出现这个错误。

这个可能还是要贴一下具体报错,我改完之后就没问题了,不知道你们说的第二次是什么问题

Nmoumou commented 1 month ago

@NadekoShiro @Huiyicc 针对GPT训练时,s1_train.py也报错use_libuv was requested but PyTorch was build without libuv support 需要修改 s1_train.py

    logger = TensorBoardLogger(name=output_dir.stem, save_dir=output_dir)
    os.environ["MASTER_ADDR"]="localhost"
    os.environ["USE_LIBUV"] = "0"
    trainer: Trainer = Trainer(

在120行后面添加os.environ["USE_LIBUV"] = "0" 可以解决

TheChosenPerson commented 1 month ago

实测可行

mutified commented 1 week ago

To compile pytorch with libuv one would need:

Be prepared for 100% CPU usage. Ninja does that...

Maybe it's better to downgrade. Sometimes this breaks my virtual environment.

I asked ChatGPT about the usage of libuv. It's wordy so there may be some incorrectness in its response. In short, when training a model using multiple GPUs, libuv is needed to help distribute the processing. In large-scale distributed training, a machine learning model and its training process are divided across multiple devices (GPUs, CPUs) and/or machines (nodes in a cluster). The goal is to speed up training, handle larger models, and process massive datasets that cannot fit in the memory of a single device or machine. Here's a breakdown of what exactly is distributed in these large-scale jobs with more than 1024 ranks:

Key Components Distributed in Large-Scale Training:

1. Model Parameters:

2. Data:

3. Computations:

4. Communication:

5. Optimizer States:

6. Randomness/Seeding:

Types of Distribution in Large-Scale Training:

1. Data Parallelism:

2. Model Parallelism:

3. Pipeline Parallelism:

4. Hybrid Parallelism:

Why Large-Scale Distribution is Important:

  1. Handling Massive Models: Models with hundreds of billions of parameters (such as GPT-3 or large transformer models) cannot fit into a single device's memory. Model parallelism distributes the model across multiple devices to handle these huge architectures.

  2. Training with Massive Datasets: In real-world applications like training vision models on ImageNet or language models on massive corpora, the dataset is often too large to fit into the memory of a single device. Data parallelism helps process these datasets in parallel across many devices, allowing training to scale up.

  3. Faster Training: Distributed training, especially with thousands of ranks, reduces the time required to train models by dividing the workload across multiple devices. For large companies and research labs, speeding up training is critical for iteration and experimentation.

  4. Communication Overhead: Efficient communication between ranks becomes critical as the number of nodes increases. This is where libraries like libuv help optimize I/O and communication between nodes.

Conclusion:

In large-scale distributed training, components such as model parameters, data, computations (forward and backward passes), gradients, optimizer states, and communication are distributed across thousands of ranks to enable efficient training of huge models on massive datasets. Libraries like libuv help PyTorch scale this to more than 1024 ranks by handling the communication challenges efficiently and making distributed training more scalable and robust.