ContextualAI / gritlm

Generative Representational Instruction Tuning
https://arxiv.org/abs/2402.09906
MIT License
479 stars 33 forks source link

When I was preparing to train the qwen2 model using the gritlm project, I encountered this error:AssertionError: Some of models are not wrapped in DistributedDataParallel. Make sure you are running DDP with proper initializations. #43

Closed YuvalCheung closed 3 days ago

YuvalCheung commented 6 days ago

When I executed bash train_gritlm_7b.sh, I encountered the following error:

[default2]:[rank2]: AssertionError: Some of models are not wrapped in DistributedDataParallel. Make sure you are running DDP with proper initializations.
[default0]:[rank0]: Traceback (most recent call last):
[default0]:[rank0]:   File "/usr/local/miniconda3/envs/gritlm/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[default0]:[rank0]:     return _run_code(code, main_globals, None,
[default0]:[rank0]:   File "/usr/local/miniconda3/envs/gritlm/lib/python3.10/runpy.py", line 86, in _run_code
[default0]:[rank0]:     exec(code, run_globals)
[default0]:[rank0]:   File "/mnt/home/zhangyu/gritlm-main/gritlm/training/run.py", line 440, in <module>
[default0]:[rank0]:     main()
[default0]:[rank0]:   File "/mnt/home/zhangyu/gritlm-main/gritlm/training/run.py", line 422, in main
[default0]:[rank0]:     trainer.train()
[default0]:[rank0]:   File "/usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/transformers/trainer.py", line 1932, in train
[default0]:[rank0]:     return inner_training_loop(
[default0]:[rank0]:   File "/mnt/home/zhangyu/gritlm-main/gritlm/training/gradcache_trainer.py", line 691, in _inner_training_loop
[default0]:[rank0]:     loss_emb = gc(inputs["query"], inputs["passage"], no_sync_except_last=no_sync_except_last)
[default0]:[rank0]:   File "/usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/grad_cache/grad_cache.py", line 70, in __call__
[default0]:[rank0]:     return self.cache_step(*args, **kwargs)
[default0]:[rank0]:   File "/usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/grad_cache/grad_cache.py", line 262, in cache_step
[default0]:[rank0]:     assert all(map(lambda m: isinstance(m, nn.parallel.DistributedDataParallel), self.models)), \
[default0]:[rank0]: AssertionError: Some of models are not wrapped in DistributedDataParallel. Make sure you are running DDP with proper initializations.
[default0]:wandb: - 0.000 MB of 0.000 MB uploaded
[default0]:wandb: You can sync this run to the cloud by running:
[default0]:wandb: wandb sync /mnt/home/zhangyu/gritlm-main/gritlm/wandb/offline-run-20240630_002800-8j46ny1y
[default0]:wandb: Find logs at: ./wandb/offline-run-20240630_002800-8j46ny1y/logs
[default0]:wandb: WARNING The new W&B backend becomes opt-out in version 0.18.0; try it out with `wandb.require("core")`! See https://wandb.me/wandb-core for more information.

In order to train the model with qwen2, I modified the train_gritlm_7b.sh file:

#!/bin/bash
#SBATCH --job-name=gritlm
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1          # crucial - only 1 task per dist per node!
#SBATCH --hint=nomultithread         # we get physical cores not logical
#SBATCH --partition=a3
#SBATCH --gres=gpu:8                 # number of gpus
#SBATCH --time 999:00:00             # maximum execution time (HH:MM:SS)
#SBATCH --output=/data/niklas/jobs/%x-%j.out           # output file name
#SBATCH --exclusive

######################
### Set enviroment ###
######################
cd /mnt/home/zhangyu/gritlm-main/gritlm
export WANDB_PROJECT="gritlm"
#NCCL_ASYNC_ERROR_HANDLING=1
# export WANDB_PROJECT="gritlm"
export WANDB_MODE="offline"
# so processes know who to talk to
GPUS_PER_NODE=8
NNODES=1
NODE_RANK=0
MASTER_PORT=6050
MASTER_ADDR=locolhost
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
######################

# OUT_DIR="/mnt/home/zhangyu/output/test_llama3_8b_6"
OUT_DIR="/mnt/home/zhangyu/output/test_qwen2_8"
MODEL="/mnt/tenant-home_speed/AIM/model/qwen2_7B_chat"
# MODEL="/mnt/tenant-home_speed/AIM/model/llama3-8b-Instruct"
DATA_DIR="training/toy_data_instruct"

LAUNCHER="accelerate launch \
    --config_file /mnt/home/zhangyu/gritlm-main/scripts/configs/config_8gpusfsdp_m7_qwen.yml \
    --num_machines $NNODES \
    --num_processes $WORLD_SIZE \
    --main_process_ip "$MASTER_ADDR" \
    --main_process_port $MASTER_PORT \
    --machine_rank $NODE_RANK \
    --rdzv_conf rdzv_backend=c10d \
    --max_restarts 0 \
    --tee 3 \
    "

export CMD=" \
    -m training.run \
    --output_dir $OUT_DIR \
    --model_name_or_path $MODEL \
    --train_data $DATA_DIR\
    --learning_rate 2e-5 \
    --lr_scheduler_type linear \
    --warmup_ratio 0.03 \
    --max_steps 1253 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 2 \
    --per_device_generative_bs 1 \
    --dataloader_drop_last \
    --normalized \
    --temperature 0.02 \
    --train_group_size 2 \
    --negatives_cross_device \
    --query_max_len 256 \
    --passage_max_len 2048 \
    --mode unified \
    --logging_steps 1 \
    --bf16 \
    --pooling_method mean \
    --use_unique_indices \
    --loss_gen_type mixed \
    --attn bbcc \
    --attn_implementation sdpa \
    --no_gen_gas \
    --gradient_checkpointing \
    --report_to "tensorboard" \
    --save_strategy "epoch" \
    --num_train_epochs 1 \
    --save_steps 5000 
    "

SRUN_ARGS=" \
    --wait=60 \
    --kill-on-bad-exit=1 \
    "

# clear; srun $SRUN_ARGS --jobid $SLURM_JOB_ID bash -c "$LAUNCHER $CMD" 2>&1

bash -c "$LAUNCHER $CMD"

And in the config_8gpusfsdp_m7_qwen.yml file, I set fsdp_transformer_layer_cls_to_wrap to Qwen2DecoderLayer.

At the same time, I modified the /lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py file into modeling_qwen2_gritlm.py according to the differences between modeling_mistral_gritlm.py and modeling_mistral.py, and overridden the former according to the requirements of the gritlm project.

Here is my modeling_qwen2.py file: modeling_qwen2_gritlm.py.zip

Finally, when I executed the command bash train_gritlm_7b.sh, I encountered an error at the beginning, and here is the log record:

log.txt

Now I am curious whether there is an issue with my .sh script settings or with the modifications made to the qwen2 model. Thank you very much for your assistance.

Muennighoff commented 6 days ago

You need to cd into gritlm/training/GradCache & run pip install -e . in order to get this change https://github.com/ContextualAI/gritlm/tree/main/gritlm/training/GradCache#preface-muennighoff

It seems like you installed GradCache from their repo but the version in this repo needs to be installed.

Muennighoff commented 6 days ago

Adjusted the README a bit, lmk if this is better: https://github.com/ContextualAI/gritlm/commit/58ccad88e34e133f6a9680d9ce44854ad26a8c7c

YuvalCheung commented 5 days ago

Thank you for your help. After installing GradCache with the correct method, the previous error is no longer occurring, but I have encountered another error. Have you encountered similar errors before?

[default0]:{'loss': 3.384, 'learning_rate': 1.142857142857143e-06, 'epoch': 0.95}
[default0]: 95%|█████████▍| 35/37 [25:31<01:22, 41.20s/it][default0]:
[default0]: 97%|█████████▋| 36/37 [26:12<00:41, 41.16s/it][default0]:
[default0]:100%|██████████| 37/37 [26:58<00:00, 42.63s/it][default0]:07/01/2024 01:54:14 - INFO - accelerate.utils.fsdp_utils -   Saving model to /mnt/home/zhangyu/output/test_mistral_2/tmp-checkpoint-37/pytorch_model_fsdp.bin
[default0]:07/01/2024 02:00:14 - INFO - accelerate.utils.fsdp_utils -   Model saved to /mnt/home/zhangyu/output/test_mistral_2/tmp-checkpoint-37/pytorch_model_fsdp.bin
[default0]:07/01/2024 02:03:23 - INFO - accelerate.utils.fsdp_utils -   Saving Optimizer state to /mnt/home/zhangyu/output/test_mistral_2/tmp-checkpoint-37/optimizer.bin
[default0]:07/01/2024 02:16:40 - INFO - accelerate.utils.fsdp_utils -   Optimizer state saved in /mnt/home/zhangyu/output/test_mistral_2/tmp-checkpoint-37/optimizer.bin
[default0]:[rank0]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28044, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800033 milliseconds before timing out.
[default7]:[rank7]:[E ProcessGroupNCCL.cpp:563] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28044, OpType=ALLGATHER, NumelIn=1, NumelOut=8, Timeout(ms)=1800000) ran for 1800003 milliseconds before timing out.
[default2]:[rank2]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28044, OpType=ALLGATHER, NumelIn=1, NumelOut=8, Timeout(ms)=1800000) ran for 1800092 milliseconds before timing out.
[default1]:[rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28044, OpType=ALLGATHER, NumelIn=1, NumelOut=8, Timeout(ms)=1800000) ran for 1800103 milliseconds before timing out.
[default6]:[rank6]:[E ProcessGroupNCCL.cpp:563] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28044, OpType=ALLGATHER, NumelIn=1, NumelOut=8, Timeout(ms)=1800000) ran for 1800059 milliseconds before timing out.
[default5]:[rank5]:[E ProcessGroupNCCL.cpp:563] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28044, OpType=ALLGATHER, NumelIn=1, NumelOut=8, Timeout(ms)=1800000) ran for 1800114 milliseconds before timing out.
[default3]:[rank3]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28044, OpType=ALLGATHER, NumelIn=1, NumelOut=8, Timeout(ms)=1800000) ran for 1800092 milliseconds before timing out.
[default4]:[rank4]:[E ProcessGroupNCCL.cpp:563] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28044, OpType=ALLGATHER, NumelIn=1, NumelOut=8, Timeout(ms)=1800000) ran for 1800092 milliseconds before timing out.
[default6]:[rank6]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 6] Timeout at NCCL work: 28044, last enqueued NCCL work: 28044, last completed NCCL work: 28043.
[default6]:[rank6]:[E ProcessGroupNCCL.cpp:577] [Rank 6] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[default6]:[rank6]:[E ProcessGroupNCCL.cpp:583] [Rank 6] To avoid data inconsistency, we are taking the entire process down.
[default6]:[rank6]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28044, OpType=ALLGATHER, NumelIn=1, NumelOut=8, Timeout(ms)=1800000) ran for 1800059 milliseconds before timing out.
[default6]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
[default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2633db8897 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libc10.so)
[default6]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f26350931b2 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default6]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f2635097fd0 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default6]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f263509931c in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default6]:frame #4: <unknown function> + 0xd6de4 (0x7f2689d76de4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[default6]:frame #5: <unknown function> + 0x8609 (0x7f2692adc609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[default6]:frame #6: clone + 0x43 (0x7f26928a7133 in /lib/x86_64-linux-gnu/libc.so.6)
[default6]:
[default6]:terminate called after throwing an instance of 'c10::DistBackendError'
[default6]:  what():  [PG 0 Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28044, OpType=ALLGATHER, NumelIn=1, NumelOut=8, Timeout(ms)=1800000) ran for 1800059 milliseconds before timing out.
[default6]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
[default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2633db8897 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libc10.so)
[default6]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f26350931b2 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default6]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f2635097fd0 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default6]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f263509931c in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default6]:frame #4: <unknown function> + 0xd6de4 (0x7f2689d76de4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[default6]:frame #5: <unknown function> + 0x8609 (0x7f2692adc609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[default6]:frame #6: clone + 0x43 (0x7f26928a7133 in /lib/x86_64-linux-gnu/libc.so.6)
[default6]:
[default6]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
[default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2633db8897 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libc10.so)
[default6]:frame #1: <unknown function> + 0xe32e33 (0x7f2634d1be33 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default6]:frame #2: <unknown function> + 0xd6de4 (0x7f2689d76de4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[default6]:frame #3: <unknown function> + 0x8609 (0x7f2692adc609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[default6]:frame #4: clone + 0x43 (0x7f26928a7133 in /lib/x86_64-linux-gnu/libc.so.6)
[default6]:
[default7]:07/01/2024 02:46:51 - WARNING - training.gradcache_trainer -   Checkpoint destination directory /mnt/home/zhangyu/output/test_mistral_2/checkpoint-37 already exists and is non-empty.Saving will proceed but saved results may be invalid.
[default4]:07/01/2024 02:46:51 - WARNING - training.gradcache_trainer -   Checkpoint destination directory /mnt/home/zhangyu/output/test_mistral_2/checkpoint-37 already exists and is non-empty.Saving will proceed but saved results may be invalid.
[default2]:07/01/2024 02:46:51 - WARNING - training.gradcache_trainer -   Checkpoint destination directory /mnt/home/zhangyu/output/test_mistral_2/checkpoint-37 already exists and is non-empty.Saving will proceed but saved results may be invalid.
[default1]:07/01/2024 02:46:51 - WARNING - training.gradcache_trainer -   Checkpoint destination directory /mnt/home/zhangyu/output/test_mistral_2/checkpoint-37 already exists and is non-empty.Saving will proceed but saved results may be invalid.
[default5]:07/01/2024 02:46:51 - WARNING - training.gradcache_trainer -   Checkpoint destination directory /mnt/home/zhangyu/output/test_mistral_2/checkpoint-37 already exists and is non-empty.Saving will proceed but saved results may be invalid.
[default3]:07/01/2024 02:46:51 - WARNING - training.gradcache_trainer -   Checkpoint destination directory /mnt/home/zhangyu/output/test_mistral_2/checkpoint-37 already exists and is non-empty.Saving will proceed but saved results may be invalid.
[default0]:[rank0]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 0] Timeout at NCCL work: 28044, last enqueued NCCL work: 28044, last completed NCCL work: 28043.
[default0]:[rank0]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[default0]:[rank0]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[default0]:[rank0]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28044, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800033 milliseconds before timing out.
[default0]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
[default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fbdafed3897 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libc10.so)
[default0]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fbdb11ae1b2 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default0]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fbdb11b2fd0 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default0]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fbdb11b431c in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default0]:frame #4: <unknown function> + 0xd6de4 (0x7fbe05e95de4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[default0]:frame #5: <unknown function> + 0x8609 (0x7fbe0ebfb609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[default0]:frame #6: clone + 0x43 (0x7fbe0e9c6133 in /lib/x86_64-linux-gnu/libc.so.6)
[default0]:
[default0]:terminate called after throwing an instance of 'c10::DistBackendError'
[default0]:  what():  [PG 0 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28044, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800033 milliseconds before timing out.
[default0]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
[default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fbdafed3897 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libc10.so)
[default0]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fbdb11ae1b2 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default0]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fbdb11b2fd0 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default0]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fbdb11b431c in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default0]:frame #4: <unknown function> + 0xd6de4 (0x7fbe05e95de4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[default0]:frame #5: <unknown function> + 0x8609 (0x7fbe0ebfb609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[default0]:frame #6: clone + 0x43 (0x7fbe0e9c6133 in /lib/x86_64-linux-gnu/libc.so.6)
[default0]:
[default0]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
[default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fbdafed3897 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libc10.so)
[default0]:frame #1: <unknown function> + 0xe32e33 (0x7fbdb0e36e33 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default0]:frame #2: <unknown function> + 0xd6de4 (0x7fbe05e95de4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[default0]:frame #3: <unknown function> + 0x8609 (0x7fbe0ebfb609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[default0]:frame #4: clone + 0x43 (0x7fbe0e9c6133 in /lib/x86_64-linux-gnu/libc.so.6)
[default0]:
[default1]:[rank1]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 1] Timeout at NCCL work: 28044, last enqueued NCCL work: 28044, last completed NCCL work: 28043.
[default1]:[rank1]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[default1]:[rank1]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[default1]:[rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28044, OpType=ALLGATHER, NumelIn=1, NumelOut=8, Timeout(ms)=1800000) ran for 1800103 milliseconds before timing out.
[default1]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
[default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa5705e5897 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libc10.so)
[default1]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fa5718c01b2 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default1]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fa5718c4fd0 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default1]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fa5718c631c in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default1]:frame #4: <unknown function> + 0xd6de4 (0x7fa5c65a6de4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[default1]:frame #5: <unknown function> + 0x8609 (0x7fa5cf30c609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[default1]:frame #6: clone + 0x43 (0x7fa5cf0d7133 in /lib/x86_64-linux-gnu/libc.so.6)
[default1]:
[default1]:terminate called after throwing an instance of 'c10::DistBackendError'
[default1]:  what():  [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28044, OpType=ALLGATHER, NumelIn=1, NumelOut=8, Timeout(ms)=1800000) ran for 1800103 milliseconds before timing out.
[default1]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
[default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa5705e5897 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libc10.so)
[default1]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fa5718c01b2 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default1]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fa5718c4fd0 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default1]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fa5718c631c in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default1]:frame #4: <unknown function> + 0xd6de4 (0x7fa5c65a6de4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[default1]:frame #5: <unknown function> + 0x8609 (0x7fa5cf30c609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[default1]:frame #6: clone + 0x43 (0x7fa5cf0d7133 in /lib/x86_64-linux-gnu/libc.so.6)
[default1]:
[default1]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
[default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa5705e5897 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libc10.so)
[default1]:frame #1: <unknown function> + 0xe32e33 (0x7fa571548e33 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default1]:frame #2: <unknown function> + 0xd6de4 (0x7fa5c65a6de4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[default1]:frame #3: <unknown function> + 0x8609 (0x7fa5cf30c609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[default1]:frame #4: clone + 0x43 (0x7fa5cf0d7133 in /lib/x86_64-linux-gnu/libc.so.6)
[default1]:
[default7]:[rank7]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 7] Timeout at NCCL work: 28044, last enqueued NCCL work: 28044, last completed NCCL work: 28043.
[default7]:[rank7]:[E ProcessGroupNCCL.cpp:577] [Rank 7] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[default7]:[rank7]:[E ProcessGroupNCCL.cpp:583] [Rank 7] To avoid data inconsistency, we are taking the entire process down.
[default7]:[rank7]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28044, OpType=ALLGATHER, NumelIn=1, NumelOut=8, Timeout(ms)=1800000) ran for 1800003 milliseconds before timing out.
[default7]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
[default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f76b97ca897 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libc10.so)
[default7]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f76baaa51b2 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default7]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f76baaa9fd0 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default7]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f76baaab31c in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default7]:frame #4: <unknown function> + 0xd6de4 (0x7f770f78ede4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[default7]:frame #5: <unknown function> + 0x8609 (0x7f77184f4609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[default7]:frame #6: clone + 0x43 (0x7f77182bf133 in /lib/x86_64-linux-gnu/libc.so.6)
[default7]:
[default7]:terminate called after throwing an instance of 'c10::DistBackendError'
[default7]:  what():  [PG 0 Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28044, OpType=ALLGATHER, NumelIn=1, NumelOut=8, Timeout(ms)=1800000) ran for 1800003 milliseconds before timing out.
[default7]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
[default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f76b97ca897 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libc10.so)
[default7]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f76baaa51b2 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default7]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f76baaa9fd0 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default7]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f76baaab31c in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default7]:frame #4: <unknown function> + 0xd6de4 (0x7f770f78ede4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[default7]:frame #5: <unknown function> + 0x8609 (0x7f77184f4609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[default7]:frame #6: clone + 0x43 (0x7f77182bf133 in /lib/x86_64-linux-gnu/libc.so.6)
[default7]:
[default7]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
[default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f76b97ca897 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libc10.so)
[default7]:frame #1: <unknown function> + 0xe32e33 (0x7f76ba72de33 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default7]:frame #2: <unknown function> + 0xd6de4 (0x7f770f78ede4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[default7]:frame #3: <unknown function> + 0x8609 (0x7f77184f4609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[default7]:frame #4: clone + 0x43 (0x7f77182bf133 in /lib/x86_64-linux-gnu/libc.so.6)
[default7]:
[default5]:[rank5]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 5] Timeout at NCCL work: 28044, last enqueued NCCL work: 28044, last completed NCCL work: 28043.
[default5]:[rank5]:[E ProcessGroupNCCL.cpp:577] [Rank 5] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[default5]:[rank5]:[E ProcessGroupNCCL.cpp:583] [Rank 5] To avoid data inconsistency, we are taking the entire process down.
[default5]:[rank5]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28044, OpType=ALLGATHER, NumelIn=1, NumelOut=8, Timeout(ms)=1800000) ran for 1800114 milliseconds before timing out.
[default5]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
[default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fad4e413897 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libc10.so)
[default5]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fad4f6ee1b2 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default5]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fad4f6f2fd0 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default5]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fad4f6f431c in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default5]:frame #4: <unknown function> + 0xd6de4 (0x7fada43d9de4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[default5]:frame #5: <unknown function> + 0x8609 (0x7fadad13f609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[default5]:frame #6: clone + 0x43 (0x7fadacf0a133 in /lib/x86_64-linux-gnu/libc.so.6)
[default5]:
[default5]:terminate called after throwing an instance of 'c10::DistBackendError'
[default5]:  what():  [PG 0 Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28044, OpType=ALLGATHER, NumelIn=1, NumelOut=8, Timeout(ms)=1800000) ran for 1800114 milliseconds before timing out.
[default5]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
[default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fad4e413897 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libc10.so)
[default5]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fad4f6ee1b2 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default5]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fad4f6f2fd0 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default5]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fad4f6f431c in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default5]:frame #4: <unknown function> + 0xd6de4 (0x7fada43d9de4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[default5]:frame #5: <unknown function> + 0x8609 (0x7fadad13f609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[default5]:frame #6: clone + 0x43 (0x7fadacf0a133 in /lib/x86_64-linux-gnu/libc.so.6)
[default5]:
[default5]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
[default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fad4e413897 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libc10.so)
[default5]:frame #1: <unknown function> + 0xe32e33 (0x7fad4f376e33 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default5]:frame #2: <unknown function> + 0xd6de4 (0x7fada43d9de4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[default5]:frame #3: <unknown function> + 0x8609 (0x7fadad13f609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[default5]:frame #4: clone + 0x43 (0x7fadacf0a133 in /lib/x86_64-linux-gnu/libc.so.6)
[default5]:
[default4]:[rank4]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 4] Timeout at NCCL work: 28044, last enqueued NCCL work: 28044, last completed NCCL work: 28043.
[default4]:[rank4]:[E ProcessGroupNCCL.cpp:577] [Rank 4] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[default4]:[rank4]:[E ProcessGroupNCCL.cpp:583] [Rank 4] To avoid data inconsistency, we are taking the entire process down.
[default4]:[rank4]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28044, OpType=ALLGATHER, NumelIn=1, NumelOut=8, Timeout(ms)=1800000) ran for 1800092 milliseconds before timing out.
[default4]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
[default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f05ffc95897 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libc10.so)
[default4]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f0600f701b2 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default4]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f0600f74fd0 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default4]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f0600f7631c in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default4]:frame #4: <unknown function> + 0xd6de4 (0x7f0655c58de4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[default4]:frame #5: <unknown function> + 0x8609 (0x7f065e9be609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[default4]:frame #6: clone + 0x43 (0x7f065e789133 in /lib/x86_64-linux-gnu/libc.so.6)
[default4]:
[default4]:terminate called after throwing an instance of 'c10::DistBackendError'
[default4]:  what():  [PG 0 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28044, OpType=ALLGATHER, NumelIn=1, NumelOut=8, Timeout(ms)=1800000) ran for 1800092 milliseconds before timing out.
[default4]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
[default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f05ffc95897 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libc10.so)
[default4]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f0600f701b2 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default4]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f0600f74fd0 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default4]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f0600f7631c in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default4]:frame #4: <unknown function> + 0xd6de4 (0x7f0655c58de4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[default4]:frame #5: <unknown function> + 0x8609 (0x7f065e9be609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[default4]:frame #6: clone + 0x43 (0x7f065e789133 in /lib/x86_64-linux-gnu/libc.so.6)
[default4]:
[default4]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
[default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f05ffc95897 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libc10.so)
[default4]:frame #1: <unknown function> + 0xe32e33 (0x7f0600bf8e33 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default4]:frame #2: <unknown function> + 0xd6de4 (0x7f0655c58de4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[default4]:frame #3: <unknown function> + 0x8609 (0x7f065e9be609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[default4]:frame #4: clone + 0x43 (0x7f065e789133 in /lib/x86_64-linux-gnu/libc.so.6)
[default4]:
[default3]:[rank3]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 3] Timeout at NCCL work: 28044, last enqueued NCCL work: 28044, last completed NCCL work: 28043.
[default3]:[rank3]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[default3]:[rank3]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[default3]:[rank3]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28044, OpType=ALLGATHER, NumelIn=1, NumelOut=8, Timeout(ms)=1800000) ran for 1800092 milliseconds before timing out.
[default3]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
[default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8f32344897 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libc10.so)
[default3]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f8f3361f1b2 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default3]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f8f33623fd0 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default3]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f8f3362531c in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default3]:frame #4: <unknown function> + 0xd6de4 (0x7f8f88347de4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[default3]:frame #5: <unknown function> + 0x8609 (0x7f8f910ad609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[default3]:frame #6: clone + 0x43 (0x7f8f90e78133 in /lib/x86_64-linux-gnu/libc.so.6)
[default3]:
[default3]:terminate called after throwing an instance of 'c10::DistBackendError'
[default3]:  what():  [PG 0 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28044, OpType=ALLGATHER, NumelIn=1, NumelOut=8, Timeout(ms)=1800000) ran for 1800092 milliseconds before timing out.
[default3]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
[default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8f32344897 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libc10.so)
[default3]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f8f3361f1b2 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default3]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f8f33623fd0 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default3]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f8f3362531c in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default3]:frame #4: <unknown function> + 0xd6de4 (0x7f8f88347de4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[default3]:frame #5: <unknown function> + 0x8609 (0x7f8f910ad609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[default3]:frame #6: clone + 0x43 (0x7f8f90e78133 in /lib/x86_64-linux-gnu/libc.so.6)
[default3]:
[default3]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
[default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8f32344897 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libc10.so)
[default3]:frame #1: <unknown function> + 0xe32e33 (0x7f8f332a7e33 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default3]:frame #2: <unknown function> + 0xd6de4 (0x7f8f88347de4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[default3]:frame #3: <unknown function> + 0x8609 (0x7f8f910ad609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[default3]:frame #4: clone + 0x43 (0x7f8f90e78133 in /lib/x86_64-linux-gnu/libc.so.6)
[default3]:
[default2]:[rank2]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 2] Timeout at NCCL work: 28044, last enqueued NCCL work: 28044, last completed NCCL work: 28043.
[default2]:[rank2]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[default2]:[rank2]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[default2]:[rank2]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28044, OpType=ALLGATHER, NumelIn=1, NumelOut=8, Timeout(ms)=1800000) ran for 1800092 milliseconds before timing out.
[default2]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
[default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8a75ed0897 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libc10.so)
[default2]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f8a771ab1b2 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default2]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f8a771affd0 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default2]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f8a771b131c in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default2]:frame #4: <unknown function> + 0xd6de4 (0x7f8acbe8fde4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[default2]:frame #5: <unknown function> + 0x8609 (0x7f8ad4bf5609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[default2]:frame #6: clone + 0x43 (0x7f8ad49c0133 in /lib/x86_64-linux-gnu/libc.so.6)
[default2]:
[default2]:terminate called after throwing an instance of 'c10::DistBackendError'
[default2]:  what():  [PG 0 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28044, OpType=ALLGATHER, NumelIn=1, NumelOut=8, Timeout(ms)=1800000) ran for 1800092 milliseconds before timing out.
[default2]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
[default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8a75ed0897 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libc10.so)
[default2]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f8a771ab1b2 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default2]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f8a771affd0 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default2]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f8a771b131c in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default2]:frame #4: <unknown function> + 0xd6de4 (0x7f8acbe8fde4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[default2]:frame #5: <unknown function> + 0x8609 (0x7f8ad4bf5609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[default2]:frame #6: clone + 0x43 (0x7f8ad49c0133 in /lib/x86_64-linux-gnu/libc.so.6)
[default2]:
[default2]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
[default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8a75ed0897 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libc10.so)
[default2]:frame #1: <unknown function> + 0xe32e33 (0x7f8a76e33e33 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default2]:frame #2: <unknown function> + 0xd6de4 (0x7f8acbe8fde4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[default2]:frame #3: <unknown function> + 0x8609 (0x7f8ad4bf5609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[default2]:frame #4: clone + 0x43 (0x7f8ad49c0133 in /lib/x86_64-linux-gnu/libc.so.6)
[default2]:
W0701 02:46:52.647000 139689317217472 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 771494 closing signal SIGTERM
W0701 02:46:52.647000 139689317217472 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 771495 closing signal SIGTERM
W0701 02:46:52.647000 139689317217472 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 771496 closing signal SIGTERM
W0701 02:46:52.647000 139689317217472 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 771497 closing signal SIGTERM
W0701 02:46:52.647000 139689317217472 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 771498 closing signal SIGTERM
W0701 02:46:52.648000 139689317217472 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 771500 closing signal SIGTERM
W0701 02:46:52.648000 139689317217472 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 771501 closing signal SIGTERM
E0701 02:46:54.685000 139689317217472 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 5 (pid: 771499) of binary: /usr/local/miniconda3/envs/gritlm/bin/python
Traceback (most recent call last):
  File "/usr/local/miniconda3/envs/gritlm/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1084, in launch_command
    multi_gpu_launcher(args)
  File "/usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/accelerate/commands/launch.py", line 733, in multi_gpu_launcher
    distrib_run.run(args)
  File "/usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=======================================================
training.run FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-07-01_02:46:52
  host      : ctmt240625013845lar-558799cd4d-sdndz
  rank      : 5 (local_rank: 5)
  exitcode  : -6 (pid: 771499)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 771499
=======================================================
Muennighoff commented 5 days ago

I think this is the timeout issue listed at the top here: https://github.com/ContextualAI/gritlm?tab=readme-ov-file#known-issues

YuvalCheung commented 5 days ago

The situation described in a Known issue is identical to the phenomenon I'm encountering. Currently, I'm using a GPU configuration of 1*8. Could you please advise me on how to ensure that the saving process won't be terminated?

Also, I have a question about the file structure of the model I generated after training. The file structure is as follows:

(gritlm) root@ctmt240625013845lar-558799cd4d-sdndz:/mnt/home/zhangyu/output/test_mistral_2# tree
.
├── checkpoint-37
│   ├── optimizer.bin
│   ├── pytorch_model.bin
│   ├── pytorch_model_fsdp.bin
│   ├── rng_state_0.pth
│   ├── rng_state_1.pth
│   ├── rng_state_2.pth
│   ├── rng_state_3.pth
│   ├── rng_state_4.pth
│   ├── rng_state_5.pth
│   ├── rng_state_6.pth
│   ├── rng_state_7.pth
│   ├── scheduler.pt
│   ├── special_tokens_map.json
│   ├── tokenizer.json
│   ├── tokenizer.model
│   ├── tokenizer_config.json
│   ├── trainer_state.json
│   └── training_args.bin
├── dataset_num_samples.json
└── runs
    └── Jul01_01-10-06_ctmt240625013845lar-558799cd4d-sdndz
        └── events.out.tfevents.1719767875.ctmt240625013845lar-558799cd4d-sdndz.771494.0

3 directories, 20 files

It seems quite different from the file structure in the link https://huggingface.co/GritLM/GritLM-7B/tree/main

# tree
.
├── README.md
├── config.json
├── dataset_num_samples.json
├── generation_config.json
├── model-00001-of-00003.safetensors
├── model-00002-of-00003.safetensors
├── model-00003-of-00003.safetensors
├── model.safetensors.index.json
├── modeling_gritlm7b.py
├── pytorch_model-00001-of-00003.bin
├── pytorch_model-00002-of-00003.bin
├── pytorch_model-00003-of-00003.bin
├── pytorch_model.bin.index.json
├── special_tokens_map.json
├── tokenizer.json
├── tokenizer.model
├── tokenizer_config.json
└── training_args.bin

0 directories, 18 files

Thank you for your help.

Muennighoff commented 4 days ago

1) Sorry I don't know how to solve it besides what is mentioned in the Known issues section. 2) We shard the ckpt via main/scripts/shard.py for easier usage. Added this to the README

YuvalCheung commented 4 days ago

Thank you for your help.

When I don't set no_emb_gas and no_gen_gas to True, the nccl timeout issue disappears. Should these two options have no effect on the model's capabilities?

Also, after training the model, I obtained these files:

(gritlm) root@ctmt240625013845lar-558799cd4d-sdndz:/mnt/home/zhangyu/output/test_mistral_13# tree .
.
├── checkpoint-1
│   ├── model.safetensors
│   ├── special_tokens_map.json
│   ├── tokenizer.json
│   ├── tokenizer.model
│   ├── tokenizer_config.json
│   ├── trainer_state.json
│   └── training_args.bin
├── checkpoint-2
│   ├── model.safetensors
│   ├── special_tokens_map.json
│   ├── tokenizer.json
│   ├── tokenizer.model
│   ├── tokenizer_config.json
│   ├── trainer_state.json
│   └── training_args.bin
├── config.json
├── dataset_num_samples.json
├── full_state_dict
│   ├── model.safetensors
│   ├── special_tokens_map.json
│   ├── tokenizer.json
│   ├── tokenizer.model
│   ├── tokenizer_config.json
│   └── training_args.bin
├── model.safetensors
├── runs
│   └── Jul02_00-46-47_ctmt240625013845lar-558799cd4d-sdndz
│       └── events.out.tfevents.1719852887.ctmt240625013845lar-558799cd4d-sdndz.1341295.0

Then I copied the config.json file to the checkpoint-1 directory.

When deploying the model using vllm, an error occurs:

vllm deployment script:

CUDA_VISIBLE_DEVICES=0 python3 -m vllm.entrypoints.openai.api_server \
    --served-model-name ZTEAIM-Gritm-Base \
    --model "/mnt/home/zhangyu/output/test_mistral_13/checkpoint-1" \
    --port 6000 \
    --tensor-parallel-size 1 \
    --gpu-memory-utilization 0.95 \
    --dtype bfloat16 \
    --max-model-len 4096 \
    --api-key 10344626 \

I received the following error message:

(vllm) root@ctmt240625013845lar-558799cd4d-sdndz:/mnt/home/zhangyu/model_deployment# bash launch_gritlm.sh 
INFO 07-02 10:03:12 api_server.py:209] args: Namespace(host=None, port=6000, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key='10344626', served_model_name='Gritm-Base', chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, root_path=None, middleware=[], model='/mnt/home/zhangyu/output/test_mistral_13/checkpoint-1', tokenizer=None, revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='bfloat16', kv_cache_dtype='auto', max_model_len=4096, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.95, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='cuda', engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 07-02 10:03:13 llm_engine.py:79] Initializing an LLM engine with config: model='/mnt/home/zhangyu/output/test_mistral_13/checkpoint-1', tokenizer='/mnt/home/zhangyu/output/test_mistral_13/checkpoint-1', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Traceback (most recent call last):
  File "/usr/local/miniconda3/envs/vllm/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/miniconda3/envs/vllm/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 217, in <module>
    engine = AsyncLLMEngine.from_engine_args(engine_args)
  File "/usr/local/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 625, in from_engine_args
    engine = cls(parallel_config.worker_use_ray,
  File "/usr/local/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 321, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/usr/local/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 366, in _init_engine
    return engine_class(*args, **kwargs)
  File "/usr/local/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 120, in __init__
    self._init_workers()
  File "/usr/local/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 164, in _init_workers
    self._run_workers("load_model")
  File "/usr/local/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 1012, in _run_workers
    driver_worker_output = getattr(self.driver_worker,
  File "/usr/local/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/worker.py", line 102, in load_model
    self.model_runner.load_model()
  File "/usr/local/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 84, in load_model
    self.model = get_model(self.model_config, self.device_config,
  File "/usr/local/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/model_loader.py", line 86, in get_model
    model.load_weights(model_config.model, model_config.download_dir,
  File "/usr/local/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/mistral.py", line 374, in load_weights
    param = params_dict[name]
KeyError: 'model.lm_head.weight'

If I change the model I'm deploying to GritLM-7B, then the deployment is successful:

(vllm) root@ctmt240625013845lar-558799cd4d-sdndz:/mnt/home/zhangyu/model_deployment# bash launch_gritlm.sh 
INFO 07-02 10:04:54 api_server.py:209] args: Namespace(host=None, port=6000, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key='10344626', served_model_name='Gritm-Base', chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, root_path=None, middleware=[], model='/mnt/tenant-home_speed/AIM/model/GritLM-7B', tokenizer=None, revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='bfloat16', kv_cache_dtype='auto', max_model_len=4096, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.95, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='cuda', engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 07-02 10:04:54 llm_engine.py:79] Initializing an LLM engine with config: model='/mnt/tenant-home_speed/AIM/model/GritLM-7B', tokenizer='/mnt/tenant-home_speed/AIM/model/GritLM-7B', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
INFO 07-02 10:05:01 llm_engine.py:337] # GPU blocks: 31287, # CPU blocks: 2048
INFO 07-02 10:05:06 model_runner.py:666] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 07-02 10:05:06 model_runner.py:670] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 07-02 10:05:09 model_runner.py:738] Graph capturing finished in 4 secs.
INFO 07-02 10:05:10 serving_chat.py:260] Using default chat template:
INFO 07-02 10:05:10 serving_chat.py:260] {{ bos_token }}{% for message in messages %}
INFO 07-02 10:05:10 serving_chat.py:260] {% if message['role'] == 'user' %}
INFO 07-02 10:05:10 serving_chat.py:260] {{ '<|user|>
INFO 07-02 10:05:10 serving_chat.py:260] ' + message['content'] }}
INFO 07-02 10:05:10 serving_chat.py:260] {% elif message['role'] == 'assistant' %}
INFO 07-02 10:05:10 serving_chat.py:260] {{ '<|assistant|>
INFO 07-02 10:05:10 serving_chat.py:260] '  + message['content'] + eos_token }}
INFO 07-02 10:05:10 serving_chat.py:260] {% endif %}
INFO 07-02 10:05:10 serving_chat.py:260] {% if loop.last and add_generation_prompt %}
INFO 07-02 10:05:10 serving_chat.py:260] {{ '<|assistant|>' }}
INFO 07-02 10:05:10 serving_chat.py:260] {% endif %}
INFO 07-02 10:05:10 serving_chat.py:260] {% endfor %}
INFO:     Started server process [1554267]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:6000 (Press CTRL+C to quit)
INFO 07-02 10:05:20 metrics.py:161] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 07-02 10:05:30 metrics.py:161] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%

Is there something missing from the model I trained? Here is my training script:

#!/bin/bash
#SBATCH --job-name=gritlm
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1          # crucial - only 1 task per dist per node!
#SBATCH --hint=nomultithread         # we get physical cores not logical
#SBATCH --partition=a3
#SBATCH --gres=gpu:8                 # number of gpus
#SBATCH --time 999:00:00             # maximum execution time (HH:MM:SS)
#SBATCH --output=/data/niklas/jobs/%x-%j.out           # output file name
#SBATCH --exclusive

######################
### Set enviroment ###
######################
cd /mnt/home/zhangyu/gritlm-main/gritlm
export WANDB_PROJECT="gritlm"
#NCCL_ASYNC_ERROR_HANDLING=1
export WANDB_MODE="offline"
# so processes know who to talk to
GPUS_PER_NODE=8
NNODES=1
NODE_RANK=0
MASTER_PORT=6050
MASTER_ADDR=locolhost
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
######################

# OUT_DIR="/mnt/home/zhangyu/output/test_llama3_8b_7"
# OUT_DIR="/mnt/home/zhangyu/output/test_qwen2_10"
OUT_DIR="/mnt/home/zhangyu/output/test_mistral_14"

# MODEL="/mnt/tenant-home_speed/AIM/model/qwen2_7B_chat"
# MODEL="/mnt/tenant-home_speed/AIM/model/llama3-8b-Instruct"
MODEL="/mnt/tenant-home_speed/AIM/model/Mistral-7B-Instruct-v0.1"

DATA_DIR="training/toy_data_instruct"

# YMLPATH="/mnt/home/zhangyu/gritlm-main/scripts/configs/config_8gpusfsdp_m7_qwen.yml"
# YMLPATH="/mnt/home/zhangyu/gritlm-main/scripts/configs/config_8gpusfsdp_m7_llama.yml"
YMLPATH="/mnt/home/zhangyu/gritlm-main/scripts/configs/config_8gpusfsdp_m7.yml"
# YMLPATH="/mnt/home/zhangyu/gritlm-main/scripts/configs/config_8gpusddp_m7.yml"

LAUNCHER="accelerate launch \
    --config_file $YMLPATH \
    --num_machines $NNODES \
    --num_processes $WORLD_SIZE \
    --main_process_ip "$MASTER_ADDR" \
    --main_process_port $MASTER_PORT \
    --machine_rank $NODE_RANK \
    --role $SLURMD_NODENAME: \
    --rdzv_conf rdzv_backend=c10d \
    --max_restarts 0 \
    --tee 3 \
    "

export CMD=" \
    -m training.run \
    --output_dir $OUT_DIR \
    --model_name_or_path $MODEL \
    --train_data $DATA_DIR\
    --learning_rate 2e-5 \
    --lr_scheduler_type linear \
    --warmup_ratio 0.03 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --dataloader_drop_last \
    --normalized \
    --temperature 0.02 \
    --train_group_size 2 \
    --negatives_cross_device \
    --query_max_len 256 \
    --passage_max_len 2048 \
    --mode unified \
    --logging_steps 1 \
    --bf16 \
    --pooling_method mean \
    --use_unique_indices \
    --loss_gen_factor 0.003 \
    --loss_gen_type token \
    --attn bbcc \
    --attn_implementation sdpa \
    --gradient_checkpointing \
    --report_to "tensorboard" \
    --save_strategy "epoch" \
    --save_steps 1 \
    --save_only_model \
    --save_safetensors \
    --max_steps 1500 \
    --ddp_backend gloo \
    --num_train_epochs 1
    "

SRUN_ARGS=" \
    --wait=60 \
    --kill-on-bad-exit=1 \
    "

    # --no_gen_gas \
    # --no_emb_gas \
#     --no_gen_gas \
#       --split_emb \
    # --split_emb_full \
# clear; srun $SRUN_ARGS --jobid $SLURM_JOB_ID bash -c "$LAUNCHER $CMD" 2>&1

#     --max_steps 1253 \

    # --save_strategy "epoch" \

bash -c "$LAUNCHER $CMD"

Thank you again for answering my questions.

Muennighoff commented 4 days ago

When I don't set no_emb_gas and no_gen_gas to True, the nccl timeout issue disappears. Should these two options have no effect on the model's capabilities?

It should not impact capabilities. Maybe it has something to do with the trainer as they will use a different trainer, see explained here: https://github.com/ContextualAI/gritlm/blob/0cc9aeab83b90f2e22bcdd2b084d51507c624d95/gritlm/training/arguments.py#L128C27-L129C48

Is there something missing from the model I trained? Here is my training script:

I don't notice a major problem. I would check the safetensors file and compare its keys with the keys of the GritLM-7B model files. Just load them each in memory and check that they have the exact same keys.

Also FYI GritLM-7B was trained from the base mistral 7b not the instruct version like you are doing, but I don't think it matters a lot.

YuvalCheung commented 3 days ago

I think I know why the model loading failed. I used the following code to inspect the generated .safetensors file:

from safetensors.torch import safe_open
st_file = "/mnt/home/zhangyu/output/test_mistral_16/checkpoint-1/model.safetensors"
with safe_open(st_file, framework="pt") as f:
    for name in f.keys():
        param = f.get_tensor(name)
        print(name)

I found that the format of these names in my model is like this:

......
model.model.layers.8.post_attention_layernorm.weight
model.model.layers.8.self_attn.k_proj.weight
model.model.layers.8.self_attn.o_proj.weight
model.model.layers.8.self_attn.q_proj.weight
model.model.layers.8.self_attn.v_proj.weight
model.model.layers.9.input_layernorm.weight
model.model.layers.9.mlp.down_proj.weight
model.model.layers.9.mlp.gate_proj.weight
model.model.layers.9.mlp.up_proj.weight
model.model.layers.9.post_attention_layernorm.weight
model.model.layers.9.self_attn.k_proj.weight
model.model.layers.9.self_attn.o_proj.weight
model.model.layers.9.self_attn.q_proj.weight
model.model.layers.9.self_attn.v_proj.weight
model.model.norm.weight
......

But when I used the same method to inspect the gritlm7b model, and the qwen2 model, the output looked like this:

model.layers.8.self_attn.q_proj.bias
model.layers.8.self_attn.q_proj.weight
model.layers.8.self_attn.v_proj.bias
model.layers.8.self_attn.v_proj.weight
model.layers.9.mlp.gate_proj.weight
model.layers.9.mlp.up_proj.weight
model.layers.9.self_attn.k_proj.bias
model.layers.9.self_attn.k_proj.weight
model.layers.9.self_attn.o_proj.weight
model.layers.9.self_attn.q_proj.bias
model.layers.9.self_attn.q_proj.weight
model.layers.9.self_attn.v_proj.bias
model.layers.9.self_attn.v_proj.weight

Obviously, the model I trained has an extra "model." prefix in the names.

I'm not sure what parameter is causing this phenomenon; it could be an environment issue or a library version problem. The only solution I can think of right now is to remove the prefix from the names after training the model with code like this. It's a silly method, but it should work.

from safetensors.torch import safe_open
from safetensors.torch import save_file
import sys
# st_file = "/mnt/home/zhangyu/output/test_mistral_17/checkpoint-1/model.safetensors"

st_file = sys.argv[1]
new_file = "/mnt/home/zhangyu/output/test/model.safetensors"
save_dict = {}
with safe_open(st_file, framework="pt") as f:
    for name in f.keys():
        param = f.get_tensor(name)
        print(name)
        new_name = name[6:]
        save_dict[new_name] = param
save_file(save_dict, new_file)

I don't know if others will encounter this problem, but I will investigate the cause of this issue later. For now, I'll focus on getting the project to run. Thank you for patiently answering my questions.

Muennighoff commented 3 days ago

Oh yes I think that is expected & there is a script for that here: https://github.com/ContextualAI/gritlm/blob/main/scripts/reformat_statedict.py

I've added it to the README, sorry!

YuvalCheung commented 3 days ago

Thank you so much for your answer!

I have one more question: Is it possible for me to move the operation that saves the config.json in https://github.com/ContextualAI/gritlm/blob/main/gritlm/training/run.py to before the training starts?

The modified code looks like this:

    # Save tokenizer & config for easy usage afterwards
    if trainer.is_world_process_zero(): 
        tokenizer.save_pretrained(training_args.output_dir)
        config.to_json_file(training_args.output_dir + "/config.json")

    # Training
    logger.info("Starting training")
    trainer.train()

    # The below does not save if state dict type is `SHARDED_STATE_DICT`
    trainer.save_model()

    # To be safe do another FS save
    if (trainer.is_fsdp_enabled) and (trainer.accelerator.state.fsdp_plugin.state_dict_type != "FULL_STATE_DICT"):
        trainer.accelerator.state.fsdp_plugin.set_state_dict_type("FULL_STATE_DICT")
        fsd_path = os.path.join(training_args.output_dir, "full_state_dict")
        os.makedirs(fsd_path, exist_ok=True)
        trainer.save_model(fsd_path)

Since I might take out a certain checkpoint during training for deployment, and the deployment requires the config.json file. Will there be any impact if I make this change?

Muennighoff commented 3 days ago

I think that should be fine