Closed YuvalCheung closed 3 days ago
You need to cd into gritlm/training/GradCache
& run pip install -e .
in order to get this change https://github.com/ContextualAI/gritlm/tree/main/gritlm/training/GradCache#preface-muennighoff
It seems like you installed GradCache from their repo but the version in this repo needs to be installed.
Adjusted the README a bit, lmk if this is better: https://github.com/ContextualAI/gritlm/commit/58ccad88e34e133f6a9680d9ce44854ad26a8c7c
Thank you for your help. After installing GradCache with the correct method, the previous error is no longer occurring, but I have encountered another error. Have you encountered similar errors before?
[default0]:{'loss': 3.384, 'learning_rate': 1.142857142857143e-06, 'epoch': 0.95}
[default0]: 95%|█████████▍| 35/37 [25:31<01:22, 41.20s/it][default0]:
[default0]: 97%|█████████▋| 36/37 [26:12<00:41, 41.16s/it][default0]:
[default0]:100%|██████████| 37/37 [26:58<00:00, 42.63s/it][default0]:07/01/2024 01:54:14 - INFO - accelerate.utils.fsdp_utils - Saving model to /mnt/home/zhangyu/output/test_mistral_2/tmp-checkpoint-37/pytorch_model_fsdp.bin
[default0]:07/01/2024 02:00:14 - INFO - accelerate.utils.fsdp_utils - Model saved to /mnt/home/zhangyu/output/test_mistral_2/tmp-checkpoint-37/pytorch_model_fsdp.bin
[default0]:07/01/2024 02:03:23 - INFO - accelerate.utils.fsdp_utils - Saving Optimizer state to /mnt/home/zhangyu/output/test_mistral_2/tmp-checkpoint-37/optimizer.bin
[default0]:07/01/2024 02:16:40 - INFO - accelerate.utils.fsdp_utils - Optimizer state saved in /mnt/home/zhangyu/output/test_mistral_2/tmp-checkpoint-37/optimizer.bin
[default0]:[rank0]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28044, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800033 milliseconds before timing out.
[default7]:[rank7]:[E ProcessGroupNCCL.cpp:563] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28044, OpType=ALLGATHER, NumelIn=1, NumelOut=8, Timeout(ms)=1800000) ran for 1800003 milliseconds before timing out.
[default2]:[rank2]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28044, OpType=ALLGATHER, NumelIn=1, NumelOut=8, Timeout(ms)=1800000) ran for 1800092 milliseconds before timing out.
[default1]:[rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28044, OpType=ALLGATHER, NumelIn=1, NumelOut=8, Timeout(ms)=1800000) ran for 1800103 milliseconds before timing out.
[default6]:[rank6]:[E ProcessGroupNCCL.cpp:563] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28044, OpType=ALLGATHER, NumelIn=1, NumelOut=8, Timeout(ms)=1800000) ran for 1800059 milliseconds before timing out.
[default5]:[rank5]:[E ProcessGroupNCCL.cpp:563] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28044, OpType=ALLGATHER, NumelIn=1, NumelOut=8, Timeout(ms)=1800000) ran for 1800114 milliseconds before timing out.
[default3]:[rank3]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28044, OpType=ALLGATHER, NumelIn=1, NumelOut=8, Timeout(ms)=1800000) ran for 1800092 milliseconds before timing out.
[default4]:[rank4]:[E ProcessGroupNCCL.cpp:563] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28044, OpType=ALLGATHER, NumelIn=1, NumelOut=8, Timeout(ms)=1800000) ran for 1800092 milliseconds before timing out.
[default6]:[rank6]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 6] Timeout at NCCL work: 28044, last enqueued NCCL work: 28044, last completed NCCL work: 28043.
[default6]:[rank6]:[E ProcessGroupNCCL.cpp:577] [Rank 6] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[default6]:[rank6]:[E ProcessGroupNCCL.cpp:583] [Rank 6] To avoid data inconsistency, we are taking the entire process down.
[default6]:[rank6]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28044, OpType=ALLGATHER, NumelIn=1, NumelOut=8, Timeout(ms)=1800000) ran for 1800059 milliseconds before timing out.
[default6]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
[default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2633db8897 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libc10.so)
[default6]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f26350931b2 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default6]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f2635097fd0 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default6]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f263509931c in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default6]:frame #4: <unknown function> + 0xd6de4 (0x7f2689d76de4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[default6]:frame #5: <unknown function> + 0x8609 (0x7f2692adc609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[default6]:frame #6: clone + 0x43 (0x7f26928a7133 in /lib/x86_64-linux-gnu/libc.so.6)
[default6]:
[default6]:terminate called after throwing an instance of 'c10::DistBackendError'
[default6]: what(): [PG 0 Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28044, OpType=ALLGATHER, NumelIn=1, NumelOut=8, Timeout(ms)=1800000) ran for 1800059 milliseconds before timing out.
[default6]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
[default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2633db8897 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libc10.so)
[default6]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f26350931b2 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default6]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f2635097fd0 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default6]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f263509931c in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default6]:frame #4: <unknown function> + 0xd6de4 (0x7f2689d76de4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[default6]:frame #5: <unknown function> + 0x8609 (0x7f2692adc609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[default6]:frame #6: clone + 0x43 (0x7f26928a7133 in /lib/x86_64-linux-gnu/libc.so.6)
[default6]:
[default6]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
[default6]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2633db8897 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libc10.so)
[default6]:frame #1: <unknown function> + 0xe32e33 (0x7f2634d1be33 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default6]:frame #2: <unknown function> + 0xd6de4 (0x7f2689d76de4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[default6]:frame #3: <unknown function> + 0x8609 (0x7f2692adc609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[default6]:frame #4: clone + 0x43 (0x7f26928a7133 in /lib/x86_64-linux-gnu/libc.so.6)
[default6]:
[default7]:07/01/2024 02:46:51 - WARNING - training.gradcache_trainer - Checkpoint destination directory /mnt/home/zhangyu/output/test_mistral_2/checkpoint-37 already exists and is non-empty.Saving will proceed but saved results may be invalid.
[default4]:07/01/2024 02:46:51 - WARNING - training.gradcache_trainer - Checkpoint destination directory /mnt/home/zhangyu/output/test_mistral_2/checkpoint-37 already exists and is non-empty.Saving will proceed but saved results may be invalid.
[default2]:07/01/2024 02:46:51 - WARNING - training.gradcache_trainer - Checkpoint destination directory /mnt/home/zhangyu/output/test_mistral_2/checkpoint-37 already exists and is non-empty.Saving will proceed but saved results may be invalid.
[default1]:07/01/2024 02:46:51 - WARNING - training.gradcache_trainer - Checkpoint destination directory /mnt/home/zhangyu/output/test_mistral_2/checkpoint-37 already exists and is non-empty.Saving will proceed but saved results may be invalid.
[default5]:07/01/2024 02:46:51 - WARNING - training.gradcache_trainer - Checkpoint destination directory /mnt/home/zhangyu/output/test_mistral_2/checkpoint-37 already exists and is non-empty.Saving will proceed but saved results may be invalid.
[default3]:07/01/2024 02:46:51 - WARNING - training.gradcache_trainer - Checkpoint destination directory /mnt/home/zhangyu/output/test_mistral_2/checkpoint-37 already exists and is non-empty.Saving will proceed but saved results may be invalid.
[default0]:[rank0]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 0] Timeout at NCCL work: 28044, last enqueued NCCL work: 28044, last completed NCCL work: 28043.
[default0]:[rank0]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[default0]:[rank0]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[default0]:[rank0]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28044, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800033 milliseconds before timing out.
[default0]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
[default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fbdafed3897 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libc10.so)
[default0]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fbdb11ae1b2 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default0]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fbdb11b2fd0 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default0]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fbdb11b431c in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default0]:frame #4: <unknown function> + 0xd6de4 (0x7fbe05e95de4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[default0]:frame #5: <unknown function> + 0x8609 (0x7fbe0ebfb609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[default0]:frame #6: clone + 0x43 (0x7fbe0e9c6133 in /lib/x86_64-linux-gnu/libc.so.6)
[default0]:
[default0]:terminate called after throwing an instance of 'c10::DistBackendError'
[default0]: what(): [PG 0 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28044, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800033 milliseconds before timing out.
[default0]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
[default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fbdafed3897 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libc10.so)
[default0]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fbdb11ae1b2 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default0]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fbdb11b2fd0 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default0]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fbdb11b431c in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default0]:frame #4: <unknown function> + 0xd6de4 (0x7fbe05e95de4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[default0]:frame #5: <unknown function> + 0x8609 (0x7fbe0ebfb609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[default0]:frame #6: clone + 0x43 (0x7fbe0e9c6133 in /lib/x86_64-linux-gnu/libc.so.6)
[default0]:
[default0]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
[default0]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fbdafed3897 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libc10.so)
[default0]:frame #1: <unknown function> + 0xe32e33 (0x7fbdb0e36e33 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default0]:frame #2: <unknown function> + 0xd6de4 (0x7fbe05e95de4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[default0]:frame #3: <unknown function> + 0x8609 (0x7fbe0ebfb609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[default0]:frame #4: clone + 0x43 (0x7fbe0e9c6133 in /lib/x86_64-linux-gnu/libc.so.6)
[default0]:
[default1]:[rank1]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 1] Timeout at NCCL work: 28044, last enqueued NCCL work: 28044, last completed NCCL work: 28043.
[default1]:[rank1]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[default1]:[rank1]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[default1]:[rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28044, OpType=ALLGATHER, NumelIn=1, NumelOut=8, Timeout(ms)=1800000) ran for 1800103 milliseconds before timing out.
[default1]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
[default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa5705e5897 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libc10.so)
[default1]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fa5718c01b2 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default1]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fa5718c4fd0 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default1]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fa5718c631c in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default1]:frame #4: <unknown function> + 0xd6de4 (0x7fa5c65a6de4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[default1]:frame #5: <unknown function> + 0x8609 (0x7fa5cf30c609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[default1]:frame #6: clone + 0x43 (0x7fa5cf0d7133 in /lib/x86_64-linux-gnu/libc.so.6)
[default1]:
[default1]:terminate called after throwing an instance of 'c10::DistBackendError'
[default1]: what(): [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28044, OpType=ALLGATHER, NumelIn=1, NumelOut=8, Timeout(ms)=1800000) ran for 1800103 milliseconds before timing out.
[default1]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
[default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa5705e5897 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libc10.so)
[default1]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fa5718c01b2 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default1]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fa5718c4fd0 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default1]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fa5718c631c in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default1]:frame #4: <unknown function> + 0xd6de4 (0x7fa5c65a6de4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[default1]:frame #5: <unknown function> + 0x8609 (0x7fa5cf30c609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[default1]:frame #6: clone + 0x43 (0x7fa5cf0d7133 in /lib/x86_64-linux-gnu/libc.so.6)
[default1]:
[default1]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
[default1]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa5705e5897 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libc10.so)
[default1]:frame #1: <unknown function> + 0xe32e33 (0x7fa571548e33 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default1]:frame #2: <unknown function> + 0xd6de4 (0x7fa5c65a6de4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[default1]:frame #3: <unknown function> + 0x8609 (0x7fa5cf30c609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[default1]:frame #4: clone + 0x43 (0x7fa5cf0d7133 in /lib/x86_64-linux-gnu/libc.so.6)
[default1]:
[default7]:[rank7]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 7] Timeout at NCCL work: 28044, last enqueued NCCL work: 28044, last completed NCCL work: 28043.
[default7]:[rank7]:[E ProcessGroupNCCL.cpp:577] [Rank 7] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[default7]:[rank7]:[E ProcessGroupNCCL.cpp:583] [Rank 7] To avoid data inconsistency, we are taking the entire process down.
[default7]:[rank7]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28044, OpType=ALLGATHER, NumelIn=1, NumelOut=8, Timeout(ms)=1800000) ran for 1800003 milliseconds before timing out.
[default7]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
[default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f76b97ca897 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libc10.so)
[default7]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f76baaa51b2 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default7]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f76baaa9fd0 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default7]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f76baaab31c in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default7]:frame #4: <unknown function> + 0xd6de4 (0x7f770f78ede4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[default7]:frame #5: <unknown function> + 0x8609 (0x7f77184f4609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[default7]:frame #6: clone + 0x43 (0x7f77182bf133 in /lib/x86_64-linux-gnu/libc.so.6)
[default7]:
[default7]:terminate called after throwing an instance of 'c10::DistBackendError'
[default7]: what(): [PG 0 Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28044, OpType=ALLGATHER, NumelIn=1, NumelOut=8, Timeout(ms)=1800000) ran for 1800003 milliseconds before timing out.
[default7]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
[default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f76b97ca897 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libc10.so)
[default7]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f76baaa51b2 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default7]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f76baaa9fd0 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default7]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f76baaab31c in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default7]:frame #4: <unknown function> + 0xd6de4 (0x7f770f78ede4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[default7]:frame #5: <unknown function> + 0x8609 (0x7f77184f4609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[default7]:frame #6: clone + 0x43 (0x7f77182bf133 in /lib/x86_64-linux-gnu/libc.so.6)
[default7]:
[default7]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
[default7]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f76b97ca897 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libc10.so)
[default7]:frame #1: <unknown function> + 0xe32e33 (0x7f76ba72de33 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default7]:frame #2: <unknown function> + 0xd6de4 (0x7f770f78ede4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[default7]:frame #3: <unknown function> + 0x8609 (0x7f77184f4609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[default7]:frame #4: clone + 0x43 (0x7f77182bf133 in /lib/x86_64-linux-gnu/libc.so.6)
[default7]:
[default5]:[rank5]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 5] Timeout at NCCL work: 28044, last enqueued NCCL work: 28044, last completed NCCL work: 28043.
[default5]:[rank5]:[E ProcessGroupNCCL.cpp:577] [Rank 5] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[default5]:[rank5]:[E ProcessGroupNCCL.cpp:583] [Rank 5] To avoid data inconsistency, we are taking the entire process down.
[default5]:[rank5]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28044, OpType=ALLGATHER, NumelIn=1, NumelOut=8, Timeout(ms)=1800000) ran for 1800114 milliseconds before timing out.
[default5]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
[default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fad4e413897 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libc10.so)
[default5]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fad4f6ee1b2 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default5]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fad4f6f2fd0 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default5]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fad4f6f431c in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default5]:frame #4: <unknown function> + 0xd6de4 (0x7fada43d9de4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[default5]:frame #5: <unknown function> + 0x8609 (0x7fadad13f609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[default5]:frame #6: clone + 0x43 (0x7fadacf0a133 in /lib/x86_64-linux-gnu/libc.so.6)
[default5]:
[default5]:terminate called after throwing an instance of 'c10::DistBackendError'
[default5]: what(): [PG 0 Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28044, OpType=ALLGATHER, NumelIn=1, NumelOut=8, Timeout(ms)=1800000) ran for 1800114 milliseconds before timing out.
[default5]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
[default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fad4e413897 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libc10.so)
[default5]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fad4f6ee1b2 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default5]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fad4f6f2fd0 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default5]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fad4f6f431c in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default5]:frame #4: <unknown function> + 0xd6de4 (0x7fada43d9de4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[default5]:frame #5: <unknown function> + 0x8609 (0x7fadad13f609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[default5]:frame #6: clone + 0x43 (0x7fadacf0a133 in /lib/x86_64-linux-gnu/libc.so.6)
[default5]:
[default5]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
[default5]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fad4e413897 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libc10.so)
[default5]:frame #1: <unknown function> + 0xe32e33 (0x7fad4f376e33 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default5]:frame #2: <unknown function> + 0xd6de4 (0x7fada43d9de4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[default5]:frame #3: <unknown function> + 0x8609 (0x7fadad13f609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[default5]:frame #4: clone + 0x43 (0x7fadacf0a133 in /lib/x86_64-linux-gnu/libc.so.6)
[default5]:
[default4]:[rank4]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 4] Timeout at NCCL work: 28044, last enqueued NCCL work: 28044, last completed NCCL work: 28043.
[default4]:[rank4]:[E ProcessGroupNCCL.cpp:577] [Rank 4] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[default4]:[rank4]:[E ProcessGroupNCCL.cpp:583] [Rank 4] To avoid data inconsistency, we are taking the entire process down.
[default4]:[rank4]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28044, OpType=ALLGATHER, NumelIn=1, NumelOut=8, Timeout(ms)=1800000) ran for 1800092 milliseconds before timing out.
[default4]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
[default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f05ffc95897 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libc10.so)
[default4]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f0600f701b2 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default4]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f0600f74fd0 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default4]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f0600f7631c in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default4]:frame #4: <unknown function> + 0xd6de4 (0x7f0655c58de4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[default4]:frame #5: <unknown function> + 0x8609 (0x7f065e9be609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[default4]:frame #6: clone + 0x43 (0x7f065e789133 in /lib/x86_64-linux-gnu/libc.so.6)
[default4]:
[default4]:terminate called after throwing an instance of 'c10::DistBackendError'
[default4]: what(): [PG 0 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28044, OpType=ALLGATHER, NumelIn=1, NumelOut=8, Timeout(ms)=1800000) ran for 1800092 milliseconds before timing out.
[default4]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
[default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f05ffc95897 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libc10.so)
[default4]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f0600f701b2 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default4]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f0600f74fd0 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default4]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f0600f7631c in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default4]:frame #4: <unknown function> + 0xd6de4 (0x7f0655c58de4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[default4]:frame #5: <unknown function> + 0x8609 (0x7f065e9be609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[default4]:frame #6: clone + 0x43 (0x7f065e789133 in /lib/x86_64-linux-gnu/libc.so.6)
[default4]:
[default4]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
[default4]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f05ffc95897 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libc10.so)
[default4]:frame #1: <unknown function> + 0xe32e33 (0x7f0600bf8e33 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default4]:frame #2: <unknown function> + 0xd6de4 (0x7f0655c58de4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[default4]:frame #3: <unknown function> + 0x8609 (0x7f065e9be609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[default4]:frame #4: clone + 0x43 (0x7f065e789133 in /lib/x86_64-linux-gnu/libc.so.6)
[default4]:
[default3]:[rank3]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 3] Timeout at NCCL work: 28044, last enqueued NCCL work: 28044, last completed NCCL work: 28043.
[default3]:[rank3]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[default3]:[rank3]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[default3]:[rank3]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28044, OpType=ALLGATHER, NumelIn=1, NumelOut=8, Timeout(ms)=1800000) ran for 1800092 milliseconds before timing out.
[default3]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
[default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8f32344897 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libc10.so)
[default3]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f8f3361f1b2 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default3]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f8f33623fd0 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default3]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f8f3362531c in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default3]:frame #4: <unknown function> + 0xd6de4 (0x7f8f88347de4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[default3]:frame #5: <unknown function> + 0x8609 (0x7f8f910ad609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[default3]:frame #6: clone + 0x43 (0x7f8f90e78133 in /lib/x86_64-linux-gnu/libc.so.6)
[default3]:
[default3]:terminate called after throwing an instance of 'c10::DistBackendError'
[default3]: what(): [PG 0 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28044, OpType=ALLGATHER, NumelIn=1, NumelOut=8, Timeout(ms)=1800000) ran for 1800092 milliseconds before timing out.
[default3]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
[default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8f32344897 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libc10.so)
[default3]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f8f3361f1b2 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default3]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f8f33623fd0 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default3]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f8f3362531c in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default3]:frame #4: <unknown function> + 0xd6de4 (0x7f8f88347de4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[default3]:frame #5: <unknown function> + 0x8609 (0x7f8f910ad609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[default3]:frame #6: clone + 0x43 (0x7f8f90e78133 in /lib/x86_64-linux-gnu/libc.so.6)
[default3]:
[default3]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
[default3]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8f32344897 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libc10.so)
[default3]:frame #1: <unknown function> + 0xe32e33 (0x7f8f332a7e33 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default3]:frame #2: <unknown function> + 0xd6de4 (0x7f8f88347de4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[default3]:frame #3: <unknown function> + 0x8609 (0x7f8f910ad609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[default3]:frame #4: clone + 0x43 (0x7f8f90e78133 in /lib/x86_64-linux-gnu/libc.so.6)
[default3]:
[default2]:[rank2]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 2] Timeout at NCCL work: 28044, last enqueued NCCL work: 28044, last completed NCCL work: 28043.
[default2]:[rank2]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[default2]:[rank2]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[default2]:[rank2]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28044, OpType=ALLGATHER, NumelIn=1, NumelOut=8, Timeout(ms)=1800000) ran for 1800092 milliseconds before timing out.
[default2]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
[default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8a75ed0897 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libc10.so)
[default2]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f8a771ab1b2 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default2]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f8a771affd0 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default2]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f8a771b131c in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default2]:frame #4: <unknown function> + 0xd6de4 (0x7f8acbe8fde4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[default2]:frame #5: <unknown function> + 0x8609 (0x7f8ad4bf5609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[default2]:frame #6: clone + 0x43 (0x7f8ad49c0133 in /lib/x86_64-linux-gnu/libc.so.6)
[default2]:
[default2]:terminate called after throwing an instance of 'c10::DistBackendError'
[default2]: what(): [PG 0 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=28044, OpType=ALLGATHER, NumelIn=1, NumelOut=8, Timeout(ms)=1800000) ran for 1800092 milliseconds before timing out.
[default2]:Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
[default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8a75ed0897 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libc10.so)
[default2]:frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f8a771ab1b2 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default2]:frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f8a771affd0 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default2]:frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f8a771b131c in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default2]:frame #4: <unknown function> + 0xd6de4 (0x7f8acbe8fde4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[default2]:frame #5: <unknown function> + 0x8609 (0x7f8ad4bf5609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[default2]:frame #6: clone + 0x43 (0x7f8ad49c0133 in /lib/x86_64-linux-gnu/libc.so.6)
[default2]:
[default2]:Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
[default2]:frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8a75ed0897 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libc10.so)
[default2]:frame #1: <unknown function> + 0xe32e33 (0x7f8a76e33e33 in /usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
[default2]:frame #2: <unknown function> + 0xd6de4 (0x7f8acbe8fde4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
[default2]:frame #3: <unknown function> + 0x8609 (0x7f8ad4bf5609 in /lib/x86_64-linux-gnu/libpthread.so.0)
[default2]:frame #4: clone + 0x43 (0x7f8ad49c0133 in /lib/x86_64-linux-gnu/libc.so.6)
[default2]:
W0701 02:46:52.647000 139689317217472 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 771494 closing signal SIGTERM
W0701 02:46:52.647000 139689317217472 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 771495 closing signal SIGTERM
W0701 02:46:52.647000 139689317217472 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 771496 closing signal SIGTERM
W0701 02:46:52.647000 139689317217472 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 771497 closing signal SIGTERM
W0701 02:46:52.647000 139689317217472 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 771498 closing signal SIGTERM
W0701 02:46:52.648000 139689317217472 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 771500 closing signal SIGTERM
W0701 02:46:52.648000 139689317217472 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 771501 closing signal SIGTERM
E0701 02:46:54.685000 139689317217472 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -6) local_rank: 5 (pid: 771499) of binary: /usr/local/miniconda3/envs/gritlm/bin/python
Traceback (most recent call last):
File "/usr/local/miniconda3/envs/gritlm/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1084, in launch_command
multi_gpu_launcher(args)
File "/usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/accelerate/commands/launch.py", line 733, in multi_gpu_launcher
distrib_run.run(args)
File "/usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/miniconda3/envs/gritlm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=======================================================
training.run FAILED
-------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-07-01_02:46:52
host : ctmt240625013845lar-558799cd4d-sdndz
rank : 5 (local_rank: 5)
exitcode : -6 (pid: 771499)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 771499
=======================================================
I think this is the timeout issue listed at the top here: https://github.com/ContextualAI/gritlm?tab=readme-ov-file#known-issues
The situation described in a Known issue is identical to the phenomenon I'm encountering. Currently, I'm using a GPU configuration of 1*8. Could you please advise me on how to ensure that the saving process won't be terminated?
Also, I have a question about the file structure of the model I generated after training. The file structure is as follows:
(gritlm) root@ctmt240625013845lar-558799cd4d-sdndz:/mnt/home/zhangyu/output/test_mistral_2# tree
.
├── checkpoint-37
│ ├── optimizer.bin
│ ├── pytorch_model.bin
│ ├── pytorch_model_fsdp.bin
│ ├── rng_state_0.pth
│ ├── rng_state_1.pth
│ ├── rng_state_2.pth
│ ├── rng_state_3.pth
│ ├── rng_state_4.pth
│ ├── rng_state_5.pth
│ ├── rng_state_6.pth
│ ├── rng_state_7.pth
│ ├── scheduler.pt
│ ├── special_tokens_map.json
│ ├── tokenizer.json
│ ├── tokenizer.model
│ ├── tokenizer_config.json
│ ├── trainer_state.json
│ └── training_args.bin
├── dataset_num_samples.json
└── runs
└── Jul01_01-10-06_ctmt240625013845lar-558799cd4d-sdndz
└── events.out.tfevents.1719767875.ctmt240625013845lar-558799cd4d-sdndz.771494.0
3 directories, 20 files
It seems quite different from the file structure in the link https://huggingface.co/GritLM/GritLM-7B/tree/main:
# tree
.
├── README.md
├── config.json
├── dataset_num_samples.json
├── generation_config.json
├── model-00001-of-00003.safetensors
├── model-00002-of-00003.safetensors
├── model-00003-of-00003.safetensors
├── model.safetensors.index.json
├── modeling_gritlm7b.py
├── pytorch_model-00001-of-00003.bin
├── pytorch_model-00002-of-00003.bin
├── pytorch_model-00003-of-00003.bin
├── pytorch_model.bin.index.json
├── special_tokens_map.json
├── tokenizer.json
├── tokenizer.model
├── tokenizer_config.json
└── training_args.bin
0 directories, 18 files
Thank you for your help.
1) Sorry I don't know how to solve it besides what is mentioned in the Known issues section.
2) We shard the ckpt via main/scripts/shard.py
for easier usage. Added this to the README
Thank you for your help.
When I don't set no_emb_gas and no_gen_gas to True, the nccl timeout issue disappears. Should these two options have no effect on the model's capabilities?
Also, after training the model, I obtained these files:
(gritlm) root@ctmt240625013845lar-558799cd4d-sdndz:/mnt/home/zhangyu/output/test_mistral_13# tree .
.
├── checkpoint-1
│ ├── model.safetensors
│ ├── special_tokens_map.json
│ ├── tokenizer.json
│ ├── tokenizer.model
│ ├── tokenizer_config.json
│ ├── trainer_state.json
│ └── training_args.bin
├── checkpoint-2
│ ├── model.safetensors
│ ├── special_tokens_map.json
│ ├── tokenizer.json
│ ├── tokenizer.model
│ ├── tokenizer_config.json
│ ├── trainer_state.json
│ └── training_args.bin
├── config.json
├── dataset_num_samples.json
├── full_state_dict
│ ├── model.safetensors
│ ├── special_tokens_map.json
│ ├── tokenizer.json
│ ├── tokenizer.model
│ ├── tokenizer_config.json
│ └── training_args.bin
├── model.safetensors
├── runs
│ └── Jul02_00-46-47_ctmt240625013845lar-558799cd4d-sdndz
│ └── events.out.tfevents.1719852887.ctmt240625013845lar-558799cd4d-sdndz.1341295.0
Then I copied the config.json file to the checkpoint-1 directory.
When deploying the model using vllm, an error occurs:
vllm deployment script:
CUDA_VISIBLE_DEVICES=0 python3 -m vllm.entrypoints.openai.api_server \
--served-model-name ZTEAIM-Gritm-Base \
--model "/mnt/home/zhangyu/output/test_mistral_13/checkpoint-1" \
--port 6000 \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.95 \
--dtype bfloat16 \
--max-model-len 4096 \
--api-key 10344626 \
I received the following error message:
(vllm) root@ctmt240625013845lar-558799cd4d-sdndz:/mnt/home/zhangyu/model_deployment# bash launch_gritlm.sh
INFO 07-02 10:03:12 api_server.py:209] args: Namespace(host=None, port=6000, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key='10344626', served_model_name='Gritm-Base', chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, root_path=None, middleware=[], model='/mnt/home/zhangyu/output/test_mistral_13/checkpoint-1', tokenizer=None, revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='bfloat16', kv_cache_dtype='auto', max_model_len=4096, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.95, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='cuda', engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 07-02 10:03:13 llm_engine.py:79] Initializing an LLM engine with config: model='/mnt/home/zhangyu/output/test_mistral_13/checkpoint-1', tokenizer='/mnt/home/zhangyu/output/test_mistral_13/checkpoint-1', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Traceback (most recent call last):
File "/usr/local/miniconda3/envs/vllm/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/local/miniconda3/envs/vllm/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 217, in <module>
engine = AsyncLLMEngine.from_engine_args(engine_args)
File "/usr/local/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 625, in from_engine_args
engine = cls(parallel_config.worker_use_ray,
File "/usr/local/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 321, in __init__
self.engine = self._init_engine(*args, **kwargs)
File "/usr/local/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 366, in _init_engine
return engine_class(*args, **kwargs)
File "/usr/local/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 120, in __init__
self._init_workers()
File "/usr/local/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 164, in _init_workers
self._run_workers("load_model")
File "/usr/local/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 1012, in _run_workers
driver_worker_output = getattr(self.driver_worker,
File "/usr/local/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/worker.py", line 102, in load_model
self.model_runner.load_model()
File "/usr/local/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 84, in load_model
self.model = get_model(self.model_config, self.device_config,
File "/usr/local/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/model_loader.py", line 86, in get_model
model.load_weights(model_config.model, model_config.download_dir,
File "/usr/local/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/mistral.py", line 374, in load_weights
param = params_dict[name]
KeyError: 'model.lm_head.weight'
If I change the model I'm deploying to GritLM-7B, then the deployment is successful:
(vllm) root@ctmt240625013845lar-558799cd4d-sdndz:/mnt/home/zhangyu/model_deployment# bash launch_gritlm.sh
INFO 07-02 10:04:54 api_server.py:209] args: Namespace(host=None, port=6000, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key='10344626', served_model_name='Gritm-Base', chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, root_path=None, middleware=[], model='/mnt/tenant-home_speed/AIM/model/GritLM-7B', tokenizer=None, revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='bfloat16', kv_cache_dtype='auto', max_model_len=4096, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.95, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='cuda', engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 07-02 10:04:54 llm_engine.py:79] Initializing an LLM engine with config: model='/mnt/tenant-home_speed/AIM/model/GritLM-7B', tokenizer='/mnt/tenant-home_speed/AIM/model/GritLM-7B', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
INFO 07-02 10:05:01 llm_engine.py:337] # GPU blocks: 31287, # CPU blocks: 2048
INFO 07-02 10:05:06 model_runner.py:666] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 07-02 10:05:06 model_runner.py:670] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 07-02 10:05:09 model_runner.py:738] Graph capturing finished in 4 secs.
INFO 07-02 10:05:10 serving_chat.py:260] Using default chat template:
INFO 07-02 10:05:10 serving_chat.py:260] {{ bos_token }}{% for message in messages %}
INFO 07-02 10:05:10 serving_chat.py:260] {% if message['role'] == 'user' %}
INFO 07-02 10:05:10 serving_chat.py:260] {{ '<|user|>
INFO 07-02 10:05:10 serving_chat.py:260] ' + message['content'] }}
INFO 07-02 10:05:10 serving_chat.py:260] {% elif message['role'] == 'assistant' %}
INFO 07-02 10:05:10 serving_chat.py:260] {{ '<|assistant|>
INFO 07-02 10:05:10 serving_chat.py:260] ' + message['content'] + eos_token }}
INFO 07-02 10:05:10 serving_chat.py:260] {% endif %}
INFO 07-02 10:05:10 serving_chat.py:260] {% if loop.last and add_generation_prompt %}
INFO 07-02 10:05:10 serving_chat.py:260] {{ '<|assistant|>' }}
INFO 07-02 10:05:10 serving_chat.py:260] {% endif %}
INFO 07-02 10:05:10 serving_chat.py:260] {% endfor %}
INFO: Started server process [1554267]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:6000 (Press CTRL+C to quit)
INFO 07-02 10:05:20 metrics.py:161] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 07-02 10:05:30 metrics.py:161] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
Is there something missing from the model I trained? Here is my training script:
#!/bin/bash
#SBATCH --job-name=gritlm
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1 # crucial - only 1 task per dist per node!
#SBATCH --hint=nomultithread # we get physical cores not logical
#SBATCH --partition=a3
#SBATCH --gres=gpu:8 # number of gpus
#SBATCH --time 999:00:00 # maximum execution time (HH:MM:SS)
#SBATCH --output=/data/niklas/jobs/%x-%j.out # output file name
#SBATCH --exclusive
######################
### Set enviroment ###
######################
cd /mnt/home/zhangyu/gritlm-main/gritlm
export WANDB_PROJECT="gritlm"
#NCCL_ASYNC_ERROR_HANDLING=1
export WANDB_MODE="offline"
# so processes know who to talk to
GPUS_PER_NODE=8
NNODES=1
NODE_RANK=0
MASTER_PORT=6050
MASTER_ADDR=locolhost
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
######################
# OUT_DIR="/mnt/home/zhangyu/output/test_llama3_8b_7"
# OUT_DIR="/mnt/home/zhangyu/output/test_qwen2_10"
OUT_DIR="/mnt/home/zhangyu/output/test_mistral_14"
# MODEL="/mnt/tenant-home_speed/AIM/model/qwen2_7B_chat"
# MODEL="/mnt/tenant-home_speed/AIM/model/llama3-8b-Instruct"
MODEL="/mnt/tenant-home_speed/AIM/model/Mistral-7B-Instruct-v0.1"
DATA_DIR="training/toy_data_instruct"
# YMLPATH="/mnt/home/zhangyu/gritlm-main/scripts/configs/config_8gpusfsdp_m7_qwen.yml"
# YMLPATH="/mnt/home/zhangyu/gritlm-main/scripts/configs/config_8gpusfsdp_m7_llama.yml"
YMLPATH="/mnt/home/zhangyu/gritlm-main/scripts/configs/config_8gpusfsdp_m7.yml"
# YMLPATH="/mnt/home/zhangyu/gritlm-main/scripts/configs/config_8gpusddp_m7.yml"
LAUNCHER="accelerate launch \
--config_file $YMLPATH \
--num_machines $NNODES \
--num_processes $WORLD_SIZE \
--main_process_ip "$MASTER_ADDR" \
--main_process_port $MASTER_PORT \
--machine_rank $NODE_RANK \
--role $SLURMD_NODENAME: \
--rdzv_conf rdzv_backend=c10d \
--max_restarts 0 \
--tee 3 \
"
export CMD=" \
-m training.run \
--output_dir $OUT_DIR \
--model_name_or_path $MODEL \
--train_data $DATA_DIR\
--learning_rate 2e-5 \
--lr_scheduler_type linear \
--warmup_ratio 0.03 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 1 \
--dataloader_drop_last \
--normalized \
--temperature 0.02 \
--train_group_size 2 \
--negatives_cross_device \
--query_max_len 256 \
--passage_max_len 2048 \
--mode unified \
--logging_steps 1 \
--bf16 \
--pooling_method mean \
--use_unique_indices \
--loss_gen_factor 0.003 \
--loss_gen_type token \
--attn bbcc \
--attn_implementation sdpa \
--gradient_checkpointing \
--report_to "tensorboard" \
--save_strategy "epoch" \
--save_steps 1 \
--save_only_model \
--save_safetensors \
--max_steps 1500 \
--ddp_backend gloo \
--num_train_epochs 1
"
SRUN_ARGS=" \
--wait=60 \
--kill-on-bad-exit=1 \
"
# --no_gen_gas \
# --no_emb_gas \
# --no_gen_gas \
# --split_emb \
# --split_emb_full \
# clear; srun $SRUN_ARGS --jobid $SLURM_JOB_ID bash -c "$LAUNCHER $CMD" 2>&1
# --max_steps 1253 \
# --save_strategy "epoch" \
bash -c "$LAUNCHER $CMD"
Thank you again for answering my questions.
When I don't set no_emb_gas and no_gen_gas to True, the nccl timeout issue disappears. Should these two options have no effect on the model's capabilities?
It should not impact capabilities. Maybe it has something to do with the trainer as they will use a different trainer, see explained here: https://github.com/ContextualAI/gritlm/blob/0cc9aeab83b90f2e22bcdd2b084d51507c624d95/gritlm/training/arguments.py#L128C27-L129C48
Is there something missing from the model I trained? Here is my training script:
I don't notice a major problem. I would check the safetensors file and compare its keys with the keys of the GritLM-7B model files. Just load them each in memory and check that they have the exact same keys.
Also FYI GritLM-7B was trained from the base mistral 7b not the instruct version like you are doing, but I don't think it matters a lot.
I think I know why the model loading failed. I used the following code to inspect the generated .safetensors file:
from safetensors.torch import safe_open
st_file = "/mnt/home/zhangyu/output/test_mistral_16/checkpoint-1/model.safetensors"
with safe_open(st_file, framework="pt") as f:
for name in f.keys():
param = f.get_tensor(name)
print(name)
I found that the format of these names in my model is like this:
......
model.model.layers.8.post_attention_layernorm.weight
model.model.layers.8.self_attn.k_proj.weight
model.model.layers.8.self_attn.o_proj.weight
model.model.layers.8.self_attn.q_proj.weight
model.model.layers.8.self_attn.v_proj.weight
model.model.layers.9.input_layernorm.weight
model.model.layers.9.mlp.down_proj.weight
model.model.layers.9.mlp.gate_proj.weight
model.model.layers.9.mlp.up_proj.weight
model.model.layers.9.post_attention_layernorm.weight
model.model.layers.9.self_attn.k_proj.weight
model.model.layers.9.self_attn.o_proj.weight
model.model.layers.9.self_attn.q_proj.weight
model.model.layers.9.self_attn.v_proj.weight
model.model.norm.weight
......
But when I used the same method to inspect the gritlm7b model, and the qwen2 model, the output looked like this:
model.layers.8.self_attn.q_proj.bias
model.layers.8.self_attn.q_proj.weight
model.layers.8.self_attn.v_proj.bias
model.layers.8.self_attn.v_proj.weight
model.layers.9.mlp.gate_proj.weight
model.layers.9.mlp.up_proj.weight
model.layers.9.self_attn.k_proj.bias
model.layers.9.self_attn.k_proj.weight
model.layers.9.self_attn.o_proj.weight
model.layers.9.self_attn.q_proj.bias
model.layers.9.self_attn.q_proj.weight
model.layers.9.self_attn.v_proj.bias
model.layers.9.self_attn.v_proj.weight
Obviously, the model I trained has an extra "model." prefix in the names.
I'm not sure what parameter is causing this phenomenon; it could be an environment issue or a library version problem. The only solution I can think of right now is to remove the prefix from the names after training the model with code like this. It's a silly method, but it should work.
from safetensors.torch import safe_open
from safetensors.torch import save_file
import sys
# st_file = "/mnt/home/zhangyu/output/test_mistral_17/checkpoint-1/model.safetensors"
st_file = sys.argv[1]
new_file = "/mnt/home/zhangyu/output/test/model.safetensors"
save_dict = {}
with safe_open(st_file, framework="pt") as f:
for name in f.keys():
param = f.get_tensor(name)
print(name)
new_name = name[6:]
save_dict[new_name] = param
save_file(save_dict, new_file)
I don't know if others will encounter this problem, but I will investigate the cause of this issue later. For now, I'll focus on getting the project to run. Thank you for patiently answering my questions.
Oh yes I think that is expected & there is a script for that here: https://github.com/ContextualAI/gritlm/blob/main/scripts/reformat_statedict.py
I've added it to the README, sorry!
Thank you so much for your answer!
I have one more question: Is it possible for me to move the operation that saves the config.json
in https://github.com/ContextualAI/gritlm/blob/main/gritlm/training/run.py to before the training starts?
The modified code looks like this:
# Save tokenizer & config for easy usage afterwards
if trainer.is_world_process_zero():
tokenizer.save_pretrained(training_args.output_dir)
config.to_json_file(training_args.output_dir + "/config.json")
# Training
logger.info("Starting training")
trainer.train()
# The below does not save if state dict type is `SHARDED_STATE_DICT`
trainer.save_model()
# To be safe do another FS save
if (trainer.is_fsdp_enabled) and (trainer.accelerator.state.fsdp_plugin.state_dict_type != "FULL_STATE_DICT"):
trainer.accelerator.state.fsdp_plugin.set_state_dict_type("FULL_STATE_DICT")
fsd_path = os.path.join(training_args.output_dir, "full_state_dict")
os.makedirs(fsd_path, exist_ok=True)
trainer.save_model(fsd_path)
Since I might take out a certain checkpoint during training for deployment, and the deployment requires the config.json file. Will there be any impact if I make this change?
I think that should be fine
When I executed
bash train_gritlm_7b.sh
, I encountered the following error:In order to train the model with qwen2, I modified the
train_gritlm_7b.sh
file:And in the
config_8gpusfsdp_m7_qwen.yml
file, I setfsdp_transformer_layer_cls_to_wrap
toQwen2DecoderLayer
.At the same time, I modified the
/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py
file intomodeling_qwen2_gritlm.py
according to the differences betweenmodeling_mistral_gritlm.py
andmodeling_mistral.py
, and overridden the former according to the requirements of the gritlm project.Here is my
modeling_qwen2.py
file: modeling_qwen2_gritlm.py.zipFinally, when I executed the command
bash train_gritlm_7b.sh
, I encountered an error at the beginning, and here is the log record:log.txt
Now I am curious whether there is an issue with my
.sh
script settings or with the modifications made to the qwen2 model. Thank you very much for your assistance.