qwen moe3 fine tune error

manishiitg commented 5 months ago

Please check that this issue hasn't been reported before.

[X] I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

should work

Current behaviour

getting error

(en-hi-spot, pid=18593)   0%|          | 4/1920 [46:58<304:50:24, 572.77s/it][E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=13807, OpType=ALLREDUCE, NumelIn=11274112, NumelOut=11274112, Timeout(ms)=1800000) ran for 1800221 milliseconds before timing out.
(en-hi-spot, pid=18593) [E ProcessGroupNCCL.cpp:475] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=13807, OpType=ALLREDUCE, NumelIn=11274112, NumelOut=11274112, Timeout(ms)=1800000) ran for 1800233 milliseconds before timing out.
(en-hi-spot, pid=18593) [E ProcessGroupNCCL.cpp:475] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=13807, OpType=ALLREDUCE, NumelIn=11274112, NumelOut=11274112, Timeout(ms)=1800000) ran for 1800275 milliseconds before timing out.
(en-hi-spot, pid=18593) [E ProcessGroupNCCL.cpp:475] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=13807, OpType=ALLREDUCE, NumelIn=10942336, NumelOut=10942336, Timeout(ms)=1800000) ran for 1800289 milliseconds before timing out.
(en-hi-spot, pid=18593) [E ProcessGroupNCCL.cpp:475] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=13807, OpType=ALLREDUCE, NumelIn=11274112, NumelOut=11274112, Timeout(ms)=1800000) ran for 1800340 milliseconds before timing out.
(en-hi-spot, pid=18593) [E ProcessGroupNCCL.cpp:475] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=13807, OpType=ALLREDUCE, NumelIn=11274112, NumelOut=11274112, Timeout(ms)=1800000) ran for 1800525 milliseconds before timing out.
(en-hi-spot, pid=18593) [E ProcessGroupNCCL.cpp:475] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=13807, OpType=ALLREDUCE, NumelIn=11274112, NumelOut=11274112, Timeout(ms)=1800000) ran for 1800549 milliseconds before timing out.
(en-hi-spot, pid=18593) [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
(en-hi-spot, pid=18593) [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
(en-hi-spot, pid=18593) [E ProcessGroupNCCL.cpp:916] [Rank 7] NCCL watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=13807, OpType=ALLREDUCE, NumelIn=10942336, NumelOut=10942336, Timeout(ms)=1800000) ran for 1800289 milliseconds before timing out.
(en-hi-spot, pid=18593) terminate called after throwing an instance of 'std::runtime_error'
(en-hi-spot, pid=18593)   what():  [Rank 7] NCCL watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=13807, OpType=ALLREDUCE, NumelIn=10942336, NumelOut=10942336, Timeout(ms)=1800000) ran for 1800289 milliseconds before timing out.
(en-hi-spot, pid=18593) [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
(en-hi-spot, pid=18593) [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
(en-hi-spot, pid=18593) [E ProcessGroupNCCL.cpp:916] [Rank 5] NCCL watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=13807, OpType=ALLREDUCE, NumelIn=11274112, NumelOut=11274112, Timeout(ms)=1800000) ran for 1800275 milliseconds before timing out.
(en-hi-spot, pid=18593) terminate called after throwing an instance of 'std::runtime_error'
(en-hi-spot, pid=18593)   what():  [Rank 5] NCCL watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=13807, OpType=ALLREDUCE, NumelIn=11274112, NumelOut=11274112, Timeout(ms)=1800000) ran for 1800275 milliseconds before timing out.
(en-hi-spot, pid=18593) [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
(en-hi-spot, pid=18593) [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
(en-hi-spot, pid=18593) [E ProcessGroupNCCL.cpp:916] [Rank 4] NCCL watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=13807, OpType=ALLREDUCE, NumelIn=11274112, NumelOut=11274112, Timeout(ms)=1800000) ran for 1800340 milliseconds before timing out.
(en-hi-spot, pid=18593) terminate called after throwing an instance of 'std::runtime_error'
(en-hi-spot, pid=18593)   what():  [Rank 4] NCCL watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=13807, OpType=ALLREDUCE, NumelIn=11274112, NumelOut=11274112, Timeout(ms)=1800000) ran for 1800340 milliseconds before timing out.
(en-hi-spot, pid=18593) [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
(en-hi-spot, pid=18593) [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
(en-hi-spot, pid=18593) [E ProcessGroupNCCL.cpp:916] [Rank 6] NCCL watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=13807, OpType=ALLREDUCE, NumelIn=11274112, NumelOut=11274112, Timeout(ms)=1800000) ran for 1800233 milliseconds before timing out.
(en-hi-spot, pid=18593) terminate called after throwing an instance of 'std::runtime_error'
(en-hi-spot, pid=18593)   what():  [Rank 6] NCCL watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=13807, OpType=ALLREDUCE, NumelIn=11274112, NumelOut=11274112, Timeout(ms)=1800000) ran for 1800233 milliseconds before timing out.
(en-hi-spot, pid=18593) [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
(en-hi-spot, pid=18593) [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
(en-hi-spot, pid=18593) [E ProcessGroupNCCL.cpp:916] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=13807, OpType=ALLREDUCE, NumelIn=11274112, NumelOut=11274112, Timeout(ms)=1800000) ran for 1800549 milliseconds before timing out.
(en-hi-spot, pid=18593) terminate called after throwing an instance of 'std::runtime_error'
(en-hi-spot, pid=18593)   what():  [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=13807, OpType=ALLREDUCE, NumelIn=11274112, NumelOut=11274112, Timeout(ms)=1800000) ran for 1800549 milliseconds before timing out.
(en-hi-spot, pid=18593) [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
(en-hi-spot, pid=18593) [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
(en-hi-spot, pid=18593) [E ProcessGroupNCCL.cpp:916] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=13807, OpType=ALLREDUCE, NumelIn=11274112, NumelOut=11274112, Timeout(ms)=1800000) ran for 1800525 milliseconds before timing out.
(en-hi-spot, pid=18593) terminate called after throwing an instance of 'std::runtime_error'
(en-hi-spot, pid=18593)   what():  [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=13807, OpType=ALLREDUCE, NumelIn=11274112, NumelOut=11274112, Timeout(ms)=1800000) ran for 1800525 milliseconds before timing out.
(en-hi-spot, pid=18593) [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
(en-hi-spot, pid=18593) [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
(en-hi-spot, pid=18593) [E ProcessGroupNCCL.cpp:916] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=13807, OpType=ALLREDUCE, NumelIn=11274112, NumelOut=11274112, Timeout(ms)=1800000) ran for 1800221 milliseconds before timing out.
(en-hi-spot, pid=18593) terminate called after throwing an instance of 'std::runtime_error'
(en-hi-spot, pid=18593)   what():  [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=13807, OpType=ALLREDUCE, NumelIn=11274112, NumelOut=11274112, Timeout(ms)=1800000) ran for 1800221 milliseconds before timing out.
(en-hi-spot, pid=18593) [2024-04-08 07:17:53,186] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 92 closing signal SIGTERM
(en-hi-spot, pid=18593) [2024-04-08 07:17:53,187] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 93 closing signal SIGTERM
(en-hi-spot, pid=18593) [2024-04-08 07:17:53,187] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 94 closing signal SIGTERM
(en-hi-spot, pid=18593) [2024-04-08 07:17:53,187] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 96 closing signal SIGTERM
(en-hi-spot, pid=18593) [2024-04-08 07:17:53,187] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 97 closing signal SIGTERM
(en-hi-spot, pid=18593) [2024-04-08 07:17:53,187] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 98 closing signal SIGTERM
(en-hi-spot, pid=18593) [2024-04-08 07:17:53,187] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 99 closing signal SIGTERM
(en-hi-spot, pid=18593) [2024-04-08 07:18:23,188] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 93 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
(en-hi-spot, pid=18593) [2024-04-08 07:18:38,360] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 96 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
(en-hi-spot, pid=18593) [2024-04-08 07:18:48,002] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 3 (pid: 95) of binary: /root/miniconda3/envs/py3.10/bin/python3

Steps to reproduce

docker run --gpus all \ --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \ -v ~/sky_workdir:/sky_workdir \ -v /root/.cache:/root/.cache \ -v /sky-notebook:/sky-notebook \ -e "WANDB_API_KEY=1d2a6c1df7576a38308685e2d1a26dbb5cdb53ac" \ winglian/axolotl:main-20240408-py3.10-cu118-2.1.2 \ accelerate launch -m axolotl.cli.train /sky_workdir/hi-qwen-moe.yaml --deepspeed /sky_workdir/zero2.json

Config yaml

base_model: Qwen/Qwen1.5-MoE-A2.7B
trust_remote_code: true

load_in_8bit: false
load_in_4bit: true
strict: false

datasets:
  - path: manishiitg/aditi-syn-train-small-v3
    type: completion

# 25 has only sythentic data, and has judge removed data 
hub_model_id: manishiitg/open-aditi-chat-hi-1.25-moe
hf_use_auth_token: true

wandb_project: open-aditi-chat-hi-1.25--moe

dataset_prepared_path: manishiitg
push_dataset_to_hub: manishiitg
val_set_size: .1
output_dir: /sky-notebook/manishiitg/open-aditi-chat-hi-1.25--moe

sequence_len: 2048  # supports up to 32k
sample_packing: false
pad_to_sequence_len: false

adapter: lora
lora_model_dir:
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:

wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 8
micro_batch_size: 3
num_epochs: 2
optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: true

gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
early_stopping_patience:
resume_from_checkpoint:
auto_resume_from_checkpoints: true ## manage check point resume from here
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 10
evals_per_epoch: 2
eval_table_size:
eval_table_max_new_tokens: 128
save_steps: 20 ## increase based on your dataset
save_strategy: steps
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:

Possible solution

No response

Which Operating Systems are you using?

[X] Linux
[ ] macOS
[ ] Windows

Python Version

main

axolotl branch-commit

main

Acknowledgements

[X] My issue title is concise, descriptive, and in title casing.
[X] I have searched the existing issues to make sure this bug has not been reported yet.
[X] I am using the latest version of axolotl.
[X] I have provided enough information for the maintainers to reproduce and diagnose the issue.

winglian commented 5 months ago

@manishiitg is that the correct config/yaml you submitted? It says mistral, but the title of this says qwen moe.

manishiitg commented 5 months ago

sorry this is the correct config

base_model: Qwen/Qwen1.5-MoE-A2.7B
trust_remote_code: true

load_in_8bit: false
load_in_4bit: true
strict: false

datasets:
  - path: manishiitg/aditi-syn-train-small-v3
    type: completion

# 25 has only sythentic data, and has judge removed data 
hub_model_id: manishiitg/open-aditi-chat-hi-1.25-moe
hf_use_auth_token: true

wandb_project: open-aditi-chat-hi-1.25--moe

dataset_prepared_path: manishiitg
push_dataset_to_hub: manishiitg
val_set_size: .1
output_dir: /sky-notebook/manishiitg/open-aditi-chat-hi-1.25--moe

sequence_len: 2048  # supports up to 32k
sample_packing: false
pad_to_sequence_len: false

adapter: lora
lora_model_dir:
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:

wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 8
micro_batch_size: 3
num_epochs: 2
optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: true

gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
early_stopping_patience:
resume_from_checkpoint:
auto_resume_from_checkpoints: true ## manage check point resume from here
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 10
evals_per_epoch: 2
eval_table_size:
eval_table_max_new_tokens: 128
save_steps: 20 ## increase based on your dataset
save_strategy: steps
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:

NanoCode012 commented 5 months ago

Would you be able to update the original issue with the correct config? Secondly, have you tried the docs/nccl to see if it helps?

winglian commented 5 months ago

the issue I'm seeing is during the backwards step in accelerator

0-hero commented 4 months ago

+1 keeps happening for Mixtral-8x22B

manishiitg commented 4 months ago

@NanoCode012 updated the original yaml posted

axolotl-ai-cloud / axolotl