mixtral8x7b nccl timeout

bao-xiaoyi commented 6 months ago

Print information for worker nodes:

[E ProcessGroupNCCL.cpp:828] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=65265, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1804787 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=65265, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1803489 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=65265, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1803376 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=65265, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1803124 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=65265, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1804691 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=65265, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1804787 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=65265, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1803489 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=65265, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1804691 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=65265, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1803124 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=65265, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1803376 milliseconds before timing out.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 98 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 99 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 101 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 97) of binary: /home/kas/.conda/envs/llama-accessory/bin/python
ERROR:torch.distributed.elastic.agent.server.api:Error waiting on exit barrier. Elapsed: 303.1365716457367 seconds
Traceback (most recent call last):
File "/home/kas/.conda/envs/llama-accessory/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 920, in _exit_barrier
store_util.barrier(
File "/home/kas/.conda/envs/llama-accessory/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 78, in barrier
synchronize(store, data, rank, world_size, key_prefix, barrier_timeout)
File "/home/kas/.conda/envs/llama-accessory/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 64, in synchronize
agent_data = get_all(store, rank, key_prefix, world_size)
File "/home/kas/.conda/envs/llama-accessory/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 34, in get_all
data = store.get(f"{prefix}{idx}")
RuntimeError: Socket Timeout
Traceback (most recent call last):
File "/home/kas/.conda/envs/llama-accessory/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/kas/.conda/envs/llama-accessory/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/kas/.conda/envs/llama-accessory/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/kas/.conda/envs/llama-accessory/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/kas/.conda/envs/llama-accessory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/kas/.conda/envs/llama-accessory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
LLaMA2-Accessory/accessory/main_finetune.py FAILED
------------------------------------------------------------

Specifically, during the first save of the model, where the master node prints information:

[18:29:29.735454] Epoch: [0]  [4960/5328]  lr: 0.000004  closs: 0.0359 (0.3235)  load_balancing: 0.1030 (0.1028)  grad_norm: 2.8816 (3.2546)  time: 1.2789  data: 0.0001  max mem: 42773
[18:29:44.821415] Epoch: [0]  [4970/5328]  lr: 0.000004  closs: 0.0359 (0.3235)  load_balancing: 0.1030 (0.1028)  grad_norm: 2.8816 (3.2498)  time: 1.2792  data: 0.0001  max mem: 42773
[18:29:56.819078] Epoch: [0]  [4980/5328]  lr: 0.000004  closs: 0.0342 (0.3229)  load_balancing: 0.1030 (0.1028)  grad_norm: 2.8813 (3.2449)  time: 1.2792  data: 0.0001  max mem: 42773
[18:30:06.371144] Epoch: [0]  [4990/5328]  lr: 0.000004  closs: 0.0401 (0.3265)  load_balancing: 0.1030 (0.1028)  grad_norm: 2.8813 (3.2439)  time: 1.2746  data: 0.0001  max mem: 42773
/home/kas/.conda/envs/llama-accessory/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:1110: UserWarning: ``FullyShardedDataParallel.full_optim_state_dict``is being deprecated and is replaced by ``FullyShardedDataParallel.optim_state_dict``. ``FullyShardedDataParallel.full_optim_state_dict`` may be removed after PyTorch 2.2.
  warnings.warn(
/home/kas/.conda/envs/llama-accessory/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:1110: UserWarning: ``FullyShardedDataParallel.full_optim_state_dict``is being deprecated and is replaced by ``FullyShardedDataParallel.optim_state_dict``. ``FullyShardedDataParallel.full_optim_state_dict`` may be removed after PyTorch 2.2.
  warnings.warn(
[18:43:09.327646] model saved
/home/kas/.conda/envs/llama-accessory/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:1110: UserWarning: ``FullyShardedDataParallel.full_optim_state_dict``is being deprecated and is replaced by ``FullyShardedDataParallel.optim_state_dict``. ``FullyShardedDataParallel.full_optim_state_dict`` may be removed after PyTorch 2.2.
  warnings.warn(
/home/kas/.conda/envs/llama-accessory/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:1110: UserWarning: ``FullyShardedDataParallel.full_optim_state_dict``is being deprecated and is replaced by ``FullyShardedDataParallel.optim_state_dict``. ``FullyShardedDataParallel.full_optim_state_dict`` may be removed after PyTorch 2.2.
  warnings.warn(
/home/kas/.conda/envs/llama-accessory/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:1110: UserWarning: ``FullyShardedDataParallel.full_optim_state_dict``is being deprecated and is replaced by ``FullyShardedDataParallel.optim_state_dict``. ``FullyShardedDataParallel.full_optim_state_dict`` may be removed after PyTorch 2.2.
  warnings.warn(
/home/kas/.conda/envs/llama-accessory/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:1110: UserWarning: ``FullyShardedDataParallel.full_optim_state_dict``is being deprecated and is replaced by ``FullyShardedDataParallel.optim_state_dict``. ``FullyShardedDataParallel.full_optim_state_dict`` may be removed after PyTorch 2.2.
  warnings.warn(
/home/kas/.conda/envs/llama-accessory/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:1110: UserWarning: ``FullyShardedDataParallel.full_optim_state_dict``is being deprecated and is replaced by ``FullyShardedDataParallel.optim_state_dict``. ``FullyShardedDataParallel.full_optim_state_dict`` may be removed after PyTorch 2.2.
  warnings.warn(
/home/kas/.conda/envs/llama-accessory/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:1110: UserWarning: ``FullyShardedDataParallel.full_optim_state_dict``is being deprecated and is replaced by ``FullyShardedDataParallel.optim_state_dict``. ``FullyShardedDataParallel.full_optim_state_dict`` may be removed after PyTorch 2.2.
  warnings.warn(

bao-xiaoyi commented 6 months ago

I looked at the saved results. The model weights and optimizer parameters were successfully saved. But _save_other execution failed. The saved model can be used for inference, but it will generate a large number of repetitions.

bao-xiaoyi commented 5 months ago

In addition, I also found that the. model. pth file saved by ft (11G) is smaller than the initial. model. pth file (14G). Is this a normal phenomenon

ChrisLiu6 commented 5 months ago

The saved model can be used for inference, but it will generate a large number of repetitions.

Do you mean there are a lot of *of*.model.pth files? If yes, this is expected behavior because models are split through model (i.e. tensor) parallelism into n parts, and each part is saved separately.

I looked at the saved results. The model weights and optimizer parameters were successfully saved. But _save_other execution failed.

Could you please check if the _save_optimizer call is indeed completed on all processes? For example, you should see the same number of .optim.pth files as .model.pth files, and all the .optim.pth files have equal size. I am afraid that the problem you meet may not come from the _save_other call because _save_other involves no NCCL communication.

. model. pth file saved by ft (11G) is smaller than the initial. model. pth file (14G). Is this a normal phenomenon

It is not normal. Are the same number of .model.pth files (i.e. the model parallel size) the same? If not, it is expected that the size of each individual file will differ; if yes, please see the load_state_dict result to check if all params perfectly match. If they match, there should not be big problem

bao-xiaoyi commented 5 months ago

保存的模型可以用于推理，但会产生大量的重复。

你的意思是有很多*of*.model.pth文件吗？如果是，这是预期的行为，因为模型通过模型（即张量）并行性分为 n 个部分，并且每个部分单独保存。

我查看了保存的结果。模型权重和优化器参数已成功保存。但_save_other执行失败。

您能否检查一下_save_optimizer所有进程的调用是否确实完成了？例如，您应该看到与 .model.pth 文件相同数量的 .optim.pth 文件，并且所有 .optim.pth 文件具有相同的大小。恐怕您遇到的问题可能不是来自通话，_save_other因为_save_other不涉及 NCCL 沟通。

。模型。ft 保存的 pth 文件（11G）比初始的要小。模型。pth 文件（14G）。这是正常现象吗

这是不正常的。相同数量的.model.pth文件（即模型并行大小）是否相同？如果不是，预计每个单独文件的大小会有所不同；如果是，请查看 load_state_dict 结果以检查所有参数是否完全匹配。如果匹配的话应该问题不大

First answer the first question, The large amount of repetition I described is that the model inference will be brainless to the maximum length (the training data is far from that long). And will not terminate prematurely. The second question, The final print statement in _save_optimizer was not completed, but the file that should have been saved was saved. In other words, both the. model. pth and. optim. pth files have been saved and the quantity is correct. But there is a node in the rank specific file that has not been saved The third question, for the initial model, the size situation is as follows Amazingly, the reasoning process is normal, and no errors were found except for the length of the reasoning. And display that the total parameter quantity of the model is loaded correctly

The above figure shows the checkpoints saved by ft, while the following figure shows the initial model

bao-xiaoyi commented 5 months ago

我的想法是，在检查点保存过程中不应涉及NCCL通信操作。但在保存检查点期间或之后，该模型确实经历了nccl超时

ChrisLiu6 commented 5 months ago

And will not terminate prematurely.

If you use our default conversation template, each response should end with \n###, do you see this in your response? If it exists in the response, I guess your are using MetaModel.generate without passing the additional_stop_symbols argument, please set additional_stop_symbols=['\n###']

我的想法是，在检查点保存过程中不应涉及NCCL通信操作。但在保存检查点期间或之后，该模型确实经历了nccl超时

If the model.pth and optim.pth files are saved correctly, you may comment the save_other and save_rank_specific calls out before you figure out the cause of the NCCL error. The bug is weird.

bao-xiaoyi commented 5 months ago

optim.pth

Can I also annotate the saving of the optim.pth file? I don't think the reasoning process will involve

ChrisLiu6 commented 5 months ago

the optim.pth files are only useful to resume training. Without them your optimizer states will be lost. I you can bear this, it's okay to also comment it.

bao-xiaoyi commented 5 months ago

optim.pth 文件仅对恢复训练有用。没有它们，您的优化器状态将会丢失。我你能忍受这个，也可以评论一下。

Okay, I will continue to try when there are sufficient resources in the future. So the remaining question is that the model size has changed before and after training. What do you think is the reason? Or do we need to care about it

ChrisLiu6 commented 5 months ago

optim.pth 文件仅对恢复训练有用。没有它们，您的优化器状态将会丢失。我你能忍受这个，也可以评论一下。

Okay, I will continue to try when there are sufficient resources in the future. So the remaining question is that the model size has changed before and after training. What do you think is the reason? Or do we need to care about it

我看了下我们之前训练的，也确实出现了一样的现象，所以应该是没问题的。原因可能是格式转换的脚本多保存了一些东西，比如当tensor b = a[:2] 作为a的一个view被保存的时候事实上整个a都会被保存。不管怎样，你应该不需要担心这一点

Alpha-VLLM / LLaMA2-Accessory

mixtral8x7b nccl timeout #147