Closed Mryangkaitong closed 1 year ago
RuntimeError: NCCL Error: invalid argument
这个问题应该是pytorch CUDA版本和本机CUDA版本不一致的问题
确实是,我换了个版本,目前跑起来了(用了7卡)。不过有一个疑问就是我在https://github.com/OpenBMB/CPM-Live/blob/cpm-ant-plus/cpm-live/examples/tune.py#L136 这里加了一行代码bmt.print_rank("{}:{}".format(global_step, bmt.rank())),我理解bmt.rank()就是代表着GPU的卡号,但是打印出来的结果全是卡0,没有想象中的会有0,1,2,3,4,5,6,7。而且看GPU也确实7卡都用上了。请问这个怎么理解呢?感谢,如下:
bmtrain.print_rank
的意思是在rank0上打印内容。如果你想在每个rank上都打印,可以直接用print
好的,感谢哈,看了fintune的代码,采用的是LoRA方式训练的,相当于冻结了大部分层,只更新一些新增加的deltas层;如果想更新全部模型参数的话,需要怎么改呢?是不是把delta_model部分注释掉就好了(cpm_ant_plus/CPM-Live/cpm-live/examples/tune_cpm_ant.py)
我试了试KdConv_film上面的效果,使用LoRA方式训练的结果还可以,但是如果采用更新全部参数的话结果很不好,那是这个操作(是不是把delta_model部分注释掉就好了)有问题吗?
我试了试KdConv_film上面的效果,使用LoRA方式训练的结果还可以,但是如果采用更新全部参数的话结果很不好,那是这个操作(是不是把delta_model部分注释掉就好了)有问题吗?
这个操作没问题。你检查下是不是infer_cpm_ant.py的LoRA部分没有注释掉?
注释掉了,如下:
第26行不应该注释掉,不然你没有load训练好的模型(best.pt)
我使用cpm_ant_plus 来跑inference测试即text_generation.py没有问题,但是测试scripts/CCPM_ddp.sh的时候报错
python3 -m torch.distributed.launch --master_addr localhost --master_port 1234 --nproc_per_node 2 --nnodes 1 tune_cpm_ant.py --dataset-name CCPM --dataset-path cpm_ant_plus/CPM-Live/cpm-live/examples/data/oss_cuge/CCPM --output-path cpm_ant_plus/CPM-Live/cpm-live/examples/fintune_model/CCPM --model-path cpm_ant_plus/CPM-Live/cpm-live/model/cpm-ant-plus-10b.pt --config-path cpm_ant_plus/CPM-Live/cpm-live/model/cpm-ant-plus-10b.json --batch-size 32 --early-stop-patience 10 --eval-interval 50 --tune-maxlen 256 --lr 5e-3 --warmup-iters 50 --epochs 20 --infer-maxlen 1 /root/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torch.distributed.run. Note that --use_env is set by default in torch.distributed.run. If your script expects
--local_rank
argument to be set, please change it to read fromos.environ['LOCAL_RANK']
instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructionswarnings.warn( WARNING:torch.distributed.run:***** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
/root/anaconda3/lib/python3.8/site-packages/requests/init.py:109: RequestsDependencyWarning: urllib3 (1.26.14) or chardet (2.1.1)/charset_normalizer (2.1.1) doesn't match a supported version! warnings.warn( /root/anaconda3/lib/python3.8/site-packages/requests/init.py:109: RequestsDependencyWarning: urllib3 (1.26.14) or chardet (2.1.1)/charset_normalizer (2.1.1) doesn't match a supported version! warnings.warn( ====================== Initialization ====================== rank : 0 local_rank : 0 world_size : 2 local_size : 2 master : localhost:1234 device : 0 cpus : [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1 3, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 2 4, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 3 5, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 4 6, 47]
====================== Initialization ====================== rank : 1 local_rank : 1 world_size : 2 local_size : 2 master : localhost:1234 device : 1 cpus : [48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95]
root ├── encoder (Encoder) │ ├── layers (TransformerBlockList) │ │ └── 0-47(CheckpointBlock) │ │ ├── self_att (SelfAttentionBlock) │ │ │ ├── layernorm_before_attention (LayerNorm) weight:[0] │ │ │ └── self_attention (Attention) │ │ │ ├── project_q,project_v(Linear) weight:[0] │ │ │ │ └── lora (LowRankLinear) lora_A:[8, 4096] lora_B:[4096, 8] │ │ │ └── project_k,attention_out(Linear) weight:[0] │ │ └── ffn (FFNBlock) │ │ ├── layernorm_before_ffn (LayerNorm) weight:[0] │ │ └── ffn (FeedForward) │ │ ├── w_in (DenseGatedACT) │ │ │ ├── w_0 (Linear) weight:[12587008] │ │ │ └── w_1 (Linear) weight:[41943040] │ │ └── w_out (Linear) weight:[41943040] │ └── output_layernorm (LayerNorm) weight:[2048] ├── segment_embedding (Embedding) weight:[65536] ├── input_embedding (Embedding) weight:[179410944] └── position_bias (SegmentPositionEmbedding) relative_attention_bias:[24576] [INFO|(OpenDelta)basemodel:696]2023-02-17 19:26:12,415 >> Trainable Ratio: 6291456/4816502784=0.130623% [INFO|(OpenDelta)basemodel:698]2023-02-17 19:26:12,415 >> Delta Parameter Ratio: 6291456/4816502784=0.130623% [INFO|(OpenDelta)basemodel:700]2023-02-17 19:26:12,415 >> Static Memory 8.97 GB, Max Memory 9.63 GB root ├── encoder (Encoder) │ ├── layers (TransformerBlockList) │ │ └── 0-47(CheckpointBlock) │ │ ├── self_att (SelfAttentionBlock) │ │ │ ├── layernorm_before_attention (LayerNorm) weight:[4096] │ │ │ └── self_attention (Attention) │ │ │ ├── project_q,project_v(Linear) weight:[16777216] │ │ │ │ └── lora (LowRankLinear) lora_A:[8, 4096] lora_B:[4096, 8] │ │ │ └── project_k,attention_out(Linear) weight:[16777216] │ │ └── ffn (FFNBlock) │ │ ├── layernorm_before_ffn (LayerNorm) weight:[4096] │ │ └── ffn (FeedForward) │ │ ├── w_in (DenseGatedACT) │ │ │ ├── w_0 (Linear) weight:[29356032] │ │ │ └── w_1 (Linear) weight:[0] │ │ └── w_out (Linear) weight:[0] │ └── output_layernorm (LayerNorm) weight:[2048] ├── segment_embedding (Embedding) weight:[65536] ├── input_embedding (Embedding) weight:[179410944] └── position_bias (SegmentPositionEmbedding) relative_attention_bias:[24576] [INFO|(OpenDelta)basemodel:696]2023-02-17 19:26:12,508 >> Trainable Ratio: 6291456/4816502784=0.130623% [INFO|(OpenDelta)basemodel:698]2023-02-17 19:26:12,509 >> Delta Parameter Ratio: 6291456/4816502784=0.130623% [INFO|(OpenDelta)basemodel:700]2023-02-17 19:26:12,509 >> Static Memory 8.97 GB, Max Memory 10.30 GB [INFO] Tuning begins... Traceback (most recent call last): File "tune_cpm_ant.py", line 47, in
tune.run(data)
File "/search/ai/kaitongyang/cpm_ant_plus/CPM-Live/cpm-live/examples/tune.py", line 220, in run
self.forward(train_dataloader, eval_dataloader, cls_num=self.cls_num)
File "/search/ai/kaitongyang/cpm_ant_plus/CPM-Live/cpm-live/examples/tune.py", line 121, in forward
global_loss = bmt.sum_loss(loss).item()
File "/root/anaconda3/lib/python3.8/site-packages/bmtrain/synchronize.py", line 34, in sum_loss
return distributed.all_reduce(loss, "avg")
File "/root/anaconda3/lib/python3.8/site-packages/bmtrain/distributed/ops.py", line 92, in all_reduce
return OpAllReduce.apply(x, op)
File "/root/anaconda3/lib/python3.8/site-packages/bmtrain/distributed/ops.py", line 50, in forward
ncclAllReduce(
File "/root/anaconda3/lib/python3.8/site-packages/bmtrain/nccl/init.py", line 118, in allReduce
C.ncclAllReduce(
RuntimeError: NCCL Error: invalid argument
Traceback (most recent call last):
File "tune_cpm_ant.py", line 47, in
tune.run(data)
File "/search/ai/kaitongyang/cpm_ant_plus/CPM-Live/cpm-live/examples/tune.py", line 220, in run
self.forward(train_dataloader, eval_dataloader, cls_num=self.cls_num)
File "/search/ai/kaitongyang/cpm_ant_plus/CPM-Live/cpm-live/examples/tune.py", line 121, in forward
global_loss = bmt.sum_loss(loss).item()
File "/root/anaconda3/lib/python3.8/site-packages/bmtrain/synchronize.py", line 34, in sum_loss
return distributed.all_reduce(loss, "avg")
File "/root/anaconda3/lib/python3.8/site-packages/bmtrain/distributed/ops.py", line 92, in all_reduce
return OpAllReduce.apply(x, op)
File "/root/anaconda3/lib/python3.8/site-packages/bmtrain/distributed/ops.py", line 50, in forward
ncclAllReduce(
File "/root/anaconda3/lib/python3.8/site-packages/bmtrain/nccl/init.py", line 118, in allReduce
C.ncclAllReduce(
RuntimeError: NCCL Error: invalid argument
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 308683) of binary: /root/anaconda3/bin/python3
Traceback (most recent call last):
File "/root/anaconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/root/anaconda3/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/root/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/root/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/root/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/root/anaconda3/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run
elastic_launch(
File "/root/anaconda3/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/anaconda3/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================= Root Cause: [0]: time: 2023-02-17_19:26:16 rank: 0 (local_rank: 0) exitcode: 1 (pid: 308683) error_file: <N/A> msg: "Process failed with exitcode 1"
Other Failures: [1]: time: 2023-02-17_19:26:16 rank: 1 (local_rank: 1) exitcode: 1 (pid: 308684) error_file: <N/A> msg: "Process failed with exitcode 1"