RuntimeError: use_libuv was requested but PyTorch was build without libuv support

kellenyuan commented 1 month ago

-- Process 0 terminated with the following error: Traceback (most recent call last): File "C:\Users\11500.conda\envs\GPTSoVits\lib\site-packages\torch\multiprocessing\spawn.py", line 76, in _wrap fn(i, args) File "D:\software\ai\GPT-SoVITS-beta0706\GPT_SoVITS\s2_train.py", line 75, in run dist.init_process_group( File "C:\Users\11500.conda\envs\GPTSoVits\lib\site-packages\torch\distributed\c10d_logger.py", line 79, in wrapper return func(args, *kwargs) File "C:\Users\11500.conda\envs\GPTSoVits\lib\site-packages\torch\distributed\c10d_logger.py", line 93, in wrapper func_return = func(args, **kwargs) File "C:\Users\11500.conda\envs\GPTSoVits\lib\site-packages\torch\distributed\distributed_c10d.py", line 1361, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "C:\Users\11500.conda\envs\GPTSoVits\lib\site-packages\torch\distributed\rendezvous.py", line 258, in _env_rendezvous_handler store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv) File "C:\Users\11500.conda\envs\GPTSoVits\lib\site-packages\torch\distributed\rendezvous.py", line 185, in _create_c10d_store return TCPStore( RuntimeError: use_libuv was requested but PyTorch was build without libuv support

"C:\Users\11500.conda\envs\GPTSoVits\python.exe" GPT_SoVITS/s2_train.py --config "D:\software\ai\GPT-SoVITS-beta0706\TEMP/tmp_s2.json" INFO:IceGirl:{'train': {'log_interval': 100, 'eval_interval': 500, 'seed': 1234, 'epochs': 8, 'learning_rate': 0.0001, 'betas': [0.8, 0.99], 'eps': 1e-09, 'batch_size': 2, 'fp16_run': False, 'lr_decay': 0.999875, 'segment_size': 20480, 'init_lr_ratio': 1, 'warmup_epochs': 0, 'c_mel': 45, 'c_kl': 1.0, 'text_low_lr_rate': 0.4, 'pretrained_s2G': 'GPT_SoVITS/pretrained_models/s2G488k.pth', 'pretrained_s2D': 'GPT_SoVITS/pretrained_models/s2D488k.pth', 'if_save_latest': True, 'if_save_every_weights': True, 'save_every_epoch': 4, 'gpu_numbers': '0'}, 'data': {'max_wav_value': 32768.0, 'sampling_rate': 32000, 'filter_length': 2048, 'hop_length': 640, 'win_length': 2048, 'n_mel_channels': 128, 'mel_fmin': 0.0, 'mel_fmax': None, 'add_blank': True, 'n_speakers': 300, 'cleaned_text': True, 'exp_dir': 'logs/IceGirl'}, 'model': {'inter_channels': 192, 'hidden_channels': 192, 'filter_channels': 768, 'n_heads': 2, 'n_layers': 6, 'kernel_size': 3, 'p_dropout': 0.1, 'resblock': '1', 'resblock_kernel_sizes': [3, 7, 11], 'resblock_dilation_sizes': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'upsample_rates': [10, 8, 2, 2, 2], 'upsample_initial_channel': 512, 'upsample_kernel_sizes': [16, 16, 8, 2, 2], 'n_layers_q': 3, 'use_spectral_norm': False, 'gin_channels': 512, 'semantic_frame_rate': '25hz', 'freeze_quantizer': True}, 's2_ckpt_dir': 'logs/IceGirl', 'content_module': 'cnhubert', 'save_weight_dir': 'SoVITS_weights', 'name': 'IceGirl', 'pretrain': None, 'resume_step': None} Traceback (most recent call last): File "D:\software\ai\GPT-SoVITS-beta0706\GPT_SoVITS\s2_train.py", line 600, in main() File "D:\software\ai\GPT-SoVITS-beta0706\GPT_SoVITS\s2_train.py", line 56, in main mp.spawn( File "C:\Users\11500.conda\envs\GPTSoVits\lib\site-packages\torch\multiprocessing\spawn.py", line 282, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method="spawn") File "C:\Users\11500.conda\envs\GPTSoVits\lib\site-packages\torch\multiprocessing\spawn.py", line 238, in start_processes while not context.join(): File "C:\Users\11500.conda\envs\GPTSoVits\lib\site-packages\torch\multiprocessing\spawn.py", line 189, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "C:\Users\11500.conda\envs\GPTSoVits\lib\site-packages\torch\multiprocessing\spawn.py", line 76, in _wrap fn(i, args) File "D:\software\ai\GPT-SoVITS-beta0706\GPT_SoVITS\s2_train.py", line 75, in run dist.init_process_group( File "C:\Users\11500.conda\envs\GPTSoVits\lib\site-packages\torch\distributed\c10d_logger.py", line 79, in wrapper return func(args, *kwargs) File "C:\Users\11500.conda\envs\GPTSoVits\lib\site-packages\torch\distributed\c10d_logger.py", line 93, in wrapper func_return = func(args, **kwargs) File "C:\Users\11500.conda\envs\GPTSoVits\lib\site-packages\torch\distributed\distributed_c10d.py", line 1361, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "C:\Users\11500.conda\envs\GPTSoVits\lib\site-packages\torch\distributed\rendezvous.py", line 258, in _env_rendezvous_handler store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv) File "C:\Users\11500.conda\envs\GPTSoVits\lib\site-packages\torch\distributed\rendezvous.py", line 185, in _create_c10d_store return TCPStore( RuntimeError: use_libuv was requested but PyTorch was build without libuv support

yuncengshangdepingyuan commented 1 month ago

我通过降版本解决了这个问题，试试这个（I solved the issue by downgrading, try this）： pip3 uninstall torch torchvision torchaudio，然后（and then)： pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu118 Looking in indexes: https://download.pytorch.org/whl/cu118

TEGRAXD commented 1 month ago

Downgrading PyTorch to version 2.3.x will solve the issue. Mine was 2.3.1+cu121.

grizzlybearg commented 1 month ago

Instead of downgrading, you could follow the instructions as seen in https://pytorch.org/tutorials/intermediate/TCPStore_libuv_backend.html to disable libuv @kellenyuan

MXXXXXS commented 1 month ago

Instead of downgrading, you could follow the instructions as seen in https://pytorch.org/tutorials/intermediate/TCPStore_libuv_backend.html to disable libuv @kellenyuan

我在Retrieval-based-Voice-Conversion-WebUI也遇到这个问题了, 全局搜索找到init_method="env://"改成init_method="env://?use_libuv=False"

Huiyicc commented 1 month ago

from GPT_SoVITS/s2_train.py:78

    dist.init_process_group(
        backend = "gloo" if os.name == "nt" or not torch.cuda.is_available() else "nccl",
        init_method="env://",
        world_size=n_gpus,
        rank=rank,
    )

init_method="env://" -> init_method="env://?use_libuv=False"

TheChosenPerson commented 1 month ago

Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1 Traceback (most recent call last): File "E:\ptojects\Models\GPT-SoVITS-main\GPT-SoVITS-main\GPT_SoVITS\s1_train.py", line 183, in main(args) File "E:\ptojects\Models\GPT-SoVITS-main\GPT-SoVITS-main\GPT_SoVITS\s1_train.py", line 159, in main trainer.fit(model, data_module, ckpt_path=ckpt_path) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 538, in fit call._call_and_handle_interrupt( File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\trainer\call.py", line 46, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, kwargs) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\strategies\launchers\subprocess_script.py", line 105, in launch File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\strategies\launchers\subprocess_script.py", line 105, in launch return function(args, kwargs) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 574, in _fit_impl File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 574, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 937, in _run self.strategy.setup_environment() File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\strategies\ddp.py", line 154, in setup_environment self.setup_distributed() File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\strategies\ddp.py", line 203, in setup_distributed _init_dist_connection(self.cluster_environment, self._process_group_backend, timeout=self._timeout) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\lightning_fabric\utilities\distributed.py", line 297, in _init_dist_connection torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, kwargs) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\torch\distributed\c10d_logger.py", line 79, in wrapper return func(args, kwargs) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\torch\distributed\c10d_logger.py", line 93, in wrapper func_return = func(*args, **kwargs) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\torch\distributed\distributed_c10d.py", line 1361, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\torch\distributed\rendezvous.py", line 258, in _env_rendezvous_handler store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\torch\distributed\rendezvous.py", line 185, in _create_c10d_store return TCPStore( RuntimeError: use_libuv was requested but PyTorch was build without libuv support

Huiyicc commented 1 month ago

Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1 Traceback (most recent call last): File "E:\ptojects\Models\GPT-SoVITS-main\GPT-SoVITS-main\GPT_SoVITS\s1_train.py", line 183, in main(args) File "E:\ptojects\Models\GPT-SoVITS-main\GPT-SoVITS-main\GPT_SoVITS\s1_train.py", line 159, in main trainer.fit(model, data_module, ckpt_path=ckpt_path) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 538, in fit call._call_and_handle_interrupt( File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\trainer\call.py", line 46, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, kwargs) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\strategies\launchers\subprocess_script.py", line 105, in launch File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\strategies\launchers\subprocess_script.py", line 105, in launch return function(args, kwargs) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 574, in _fit_impl File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 574, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 937, in _run self.strategy.setup_environment() File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\strategies\ddp.py", line 154, in setup_environment self.setup_distributed() File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\strategies\ddp.py", line 203, in setup_distributed _init_dist_connection(self.cluster_environment, self._process_group_backend, timeout=self._timeout) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\lightning_fabric\utilities\distributed.py", line 297, in _init_dist_connection torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, kwargs) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\torch\distributed\c10d_logger.py", line 79, in wrapper return func(args, kwargs) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\torch\distributed\c10d_logger.py", line 93, in wrapper func_return = func(*args, **kwargs) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\torch\distributed\distributed_c10d.py", line 1361, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\torch\distributed\rendezvous.py", line 258, in _env_rendezvous_handler store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\torch\distributed\rendezvous.py", line 185, in _create_c10d_store return TCPStore( RuntimeError: use_libuv was requested but PyTorch was build without libuv support

按上面我回复的改就行了

TheChosenPerson commented 1 month ago

按照你的回复能解决一处，这是另一处trainer.fit

NadekoShiro commented 1 month ago

Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1 Traceback (most recent call last): File "E:\ptojects\Models\GPT-SoVITS-main\GPT-SoVITS-main\GPT_SoVITS\s1_train.py", line 183, in main(args) File "E:\ptojects\Models\GPT-SoVITS-main\GPT-SoVITS-main\GPT_SoVITS\s1_train.py", line 159, in main trainer.fit(model, data_module, ckpt_path=ckpt_path) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 538, in fit call._call_and_handle_interrupt( File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\trainer\call.py", line 46, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, kwargs) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\strategies\launchers\subprocess_script.py", line 105, in launch File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\strategies\launchers\subprocess_script.py", line 105, in launch return function(args, kwargs) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 574, in _fit_impl File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 574, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 937, in _run self.strategy.setup_environment() File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\strategies\ddp.py", line 154, in setup_environment self.setup_distributed() File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\strategies\ddp.py", line 203, in setup_distributed _init_dist_connection(self.cluster_environment, self._process_group_backend, timeout=self._timeout) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\lightning_fabric\utilities\distributed.py", line 297, in _init_dist_connection torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, kwargs) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\torch\distributed\c10d_logger.py", line 79, in wrapper return func(args, kwargs) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\torch\distributed\c10d_logger.py", line 93, in wrapper func_return = func(*args, **kwargs) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\torch\distributed\distributed_c10d.py", line 1361, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\torch\distributed\rendezvous.py", line 258, in _env_rendezvous_handler store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\torch\distributed\rendezvous.py", line 185, in _create_c10d_store return TCPStore( RuntimeError: use_libuv was requested but PyTorch was build without libuv support

按上面我回复的改就行了

按照大佬的方法可以解决 SoVITS 模型训练的问题，但是 GPT 模型还是会出现这个错误。

Huiyicc commented 1 month ago

Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1 Traceback (most recent call last): File "E:\ptojects\Models\GPT-SoVITS-main\GPT-SoVITS-main\GPT_SoVITS\s1_train.py", line 183, in main(args) File "E:\ptojects\Models\GPT-SoVITS-main\GPT-SoVITS-main\GPT_SoVITS\s1_train.py", line 159, in main trainer.fit(model, data_module, ckpt_path=ckpt_path) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 538, in fit call._call_and_handle_interrupt( File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\trainer\call.py", line 46, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, kwargs) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\strategies\launchers\subprocess_script.py", line 105, in launch File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\strategies\launchers\subprocess_script.py", line 105, in launch return function(args, kwargs) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 574, in _fit_impl File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 574, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 937, in _run self.strategy.setup_environment() File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\strategies\ddp.py", line 154, in setup_environment self.setup_distributed() File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\pytorch_lightning\strategies\ddp.py", line 203, in setup_distributed _init_dist_connection(self.cluster_environment, self._process_group_backend, timeout=self._timeout) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\lightning_fabric\utilities\distributed.py", line 297, in _init_dist_connection torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, kwargs) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\torch\distributed\c10d_logger.py", line 79, in wrapper return func(args, kwargs) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\torch\distributed\c10d_logger.py", line 93, in wrapper func_return = func(*args, **kwargs) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\torch\distributed\distributed_c10d.py", line 1361, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\torch\distributed\rendezvous.py", line 258, in _env_rendezvous_handler store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv) File "D:\ProgramData\anaconda3\envs\test1\lib\site-packages\torch\distributed\rendezvous.py", line 185, in _create_c10d_store return TCPStore( RuntimeError: use_libuv was requested but PyTorch was build without libuv support

按上面我回复的改就行了

按照大佬的方法可以解决 SoVITS 模型训练的问题，但是 GPT 模型还是会出现这个错误。

这个可能还是要贴一下具体报错，我改完之后就没问题了，不知道你们说的第二次是什么问题

Nmoumou commented 1 month ago

@NadekoShiro @Huiyicc 针对GPT训练时，s1_train.py也报错use_libuv was requested but PyTorch was build without libuv support 需要修改 s1_train.py

    logger = TensorBoardLogger(name=output_dir.stem, save_dir=output_dir)
    os.environ["MASTER_ADDR"]="localhost"
    os.environ["USE_LIBUV"] = "0"
    trainer: Trainer = Trainer(

在120行后面添加os.environ["USE_LIBUV"] = "0" 可以解决

TheChosenPerson commented 1 month ago

实测可行

mutified commented 1 week ago

To compile pytorch with libuv one would need:

Install cmake
clone libuv
make/build libuv
set USE_LIBUV=1
clone pytorch: initially 4 GB
build pytorch: point to libuv folder with libuv.dll and uv.dll
make pytorch: I used ninja; pytorch will then be over 14 GB (lol)
python setup.py install

Be prepared for 100% CPU usage. Ninja does that...

Maybe it's better to downgrade. Sometimes this breaks my virtual environment.

I asked ChatGPT about the usage of libuv. It's wordy so there may be some incorrectness in its response. In short, when training a model using multiple GPUs, libuv is needed to help distribute the processing. In large-scale distributed training, a machine learning model and its training process are divided across multiple devices (GPUs, CPUs) and/or machines (nodes in a cluster). The goal is to speed up training, handle larger models, and process massive datasets that cannot fit in the memory of a single device or machine. Here's a breakdown of what exactly is distributed in these large-scale jobs with more than 1024 ranks:

Key Components Distributed in Large-Scale Training:

1. Model Parameters:

In deep learning, models can consist of millions or billions of parameters (weights, biases, etc.). In distributed training, these parameters are spread across different GPUs or nodes.
Parameter Sharding: Instead of replicating the entire model on every GPU, some distributed frameworks shard the model, meaning different parts of the model are distributed to different devices.
Gradients and Weight Updates: After each forward and backward pass (during training), gradients are computed and need to be shared among all participating nodes to update the parameters consistently. This communication can be a bottleneck as the number of ranks increases.

2. Data:

Data Parallelism: A common approach is to split the training data into chunks, and each chunk is processed on a different rank (e.g., GPU or node).
Mini-Batch Distribution: Each rank processes a mini-batch of the data, computes gradients locally, and then synchronizes the results with other ranks to ensure consistent model updates.
For example, in large-scale jobs with thousands of GPUs, the dataset (e.g., images in ImageNet or text in a natural language processing corpus) is divided across all nodes so that each GPU processes a subset of the data.

3. Computations:

Forward Pass: The forward pass, where the input data is processed through the layers of the neural network, is also distributed. Depending on the model's size and complexity, parts of the computation (such as different layers) can be offloaded to different ranks.
Backward Pass: After computing the loss, the backward pass computes gradients, and these computations are also distributed across nodes, particularly when using model parallelism.
Gradient Accumulation: The gradients computed on different devices must be aggregated (reduced) and applied to update the model's parameters consistently across all nodes. This requires extensive communication between nodes, especially at large scales.

4. Communication:

Inter-Node Communication: When training models on distributed nodes, communication across nodes is required to exchange gradient updates and synchronize model parameters. This becomes a challenge as the number of nodes increases.
All-Reduce Operations: This is a common technique used to average gradients across multiple devices. Efficient communication algorithms (like NCCL, MPI, or libuv-based backends) are needed to ensure minimal overhead during this process.
Synchronization: In large-scale training, all the ranks need to synchronize at specific points (e.g., after each iteration or mini-batch) to ensure that model parameters are updated consistently.

5. Optimizer States:

Many optimizers (like Adam, LAMB, and RMSprop) maintain additional state information (such as moving averages of the gradients or squared gradients). These states need to be shared or synchronized across the ranks.
In large-scale training, the optimizer's state might also be sharded or synchronized across different nodes to reduce memory overhead and ensure consistent updates.

6. Randomness/Seeding:

To ensure reproducibility and synchronized training, random number generators (RNGs) for tasks like data shuffling or dropout must be synchronized across all nodes.
In a distributed setting, each rank must have a synchronized seed to ensure that stochastic elements (such as batch shuffling) occur consistently.

Types of Distribution in Large-Scale Training:

1. Data Parallelism:

What is Distributed?: The dataset is split across different GPUs or nodes, but each rank has a copy of the entire model. Each rank processes a different mini-batch of data, computes gradients, and communicates with other ranks to update the model.
Common Use Case: This is the most common approach in deep learning frameworks, where multiple GPUs process different subsets of data in parallel and synchronize model parameters after each mini-batch.

2. Model Parallelism:

What is Distributed?: The model itself is split across multiple ranks. Instead of each device having a full copy of the model, different layers or parts of the model are distributed across devices.
Common Use Case: This is used when the model is too large to fit in a single GPU’s memory (e.g., large transformers or GPT-like models with billions of parameters).

3. Pipeline Parallelism:

What is Distributed?: Model parallelism with a sequential division of the model across ranks. Different parts of the model are processed in a pipeline fashion across different ranks, allowing for parallel processing even in the forward and backward passes.
Common Use Case: Used in deep learning architectures that can be partitioned into stages that are processed one after the other.

4. Hybrid Parallelism:

What is Distributed?: Combines both data parallelism and model parallelism. For example, model parameters are distributed across some ranks (model parallelism), while data is also split across ranks (data parallelism).
Common Use Case: Used for very large models and datasets where no single parallelism strategy is sufficient.

Why Large-Scale Distribution is Important:

Handling Massive Models: Models with hundreds of billions of parameters (such as GPT-3 or large transformer models) cannot fit into a single device's memory. Model parallelism distributes the model across multiple devices to handle these huge architectures.
Training with Massive Datasets: In real-world applications like training vision models on ImageNet or language models on massive corpora, the dataset is often too large to fit into the memory of a single device. Data parallelism helps process these datasets in parallel across many devices, allowing training to scale up.
Faster Training: Distributed training, especially with thousands of ranks, reduces the time required to train models by dividing the workload across multiple devices. For large companies and research labs, speeding up training is critical for iteration and experimentation.
Communication Overhead: Efficient communication between ranks becomes critical as the number of nodes increases. This is where libraries like libuv help optimize I/O and communication between nodes.

Conclusion:

In large-scale distributed training, components such as model parameters, data, computations (forward and backward passes), gradients, optimizer states, and communication are distributed across thousands of ranks to enable efficient training of huge models on massive datasets. Libraries like libuv help PyTorch scale this to more than 1024 ranks by handling the communication challenges efficiently and making distributed training more scalable and robust.

RVC-Boss / GPT-SoVITS