Mryangkaitong commented 1 year ago

我使用cpm_ant_plus 来跑inference测试即text_generation.py没有问题，但是测试scripts/CCPM_ddp.sh的时候报错

python3 -m torch.distributed.launch --master_addr localhost --master_port 1234 --nproc_per_node 2 --nnodes 1 tune_cpm_ant.py --dataset-name CCPM --dataset-path cpm_ant_plus/CPM-Live/cpm-live/examples/data/oss_cuge/CCPM --output-path cpm_ant_plus/CPM-Live/cpm-live/examples/fintune_model/CCPM --model-path cpm_ant_plus/CPM-Live/cpm-live/model/cpm-ant-plus-10b.pt --config-path cpm_ant_plus/CPM-Live/cpm-live/model/cpm-ant-plus-10b.json --batch-size 32 --early-stop-patience 10 --eval-interval 50 --tune-maxlen 256 --lr 5e-3 --warmup-iters 50 --epochs 20 --infer-maxlen 1 /root/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torch.distributed.run. Note that --use_env is set by default in torch.distributed.run. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

warnings.warn( WARNING:torch.distributed.run:***** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

/root/anaconda3/lib/python3.8/site-packages/requests/init.py:109: RequestsDependencyWarning: urllib3 (1.26.14) or chardet (2.1.1)/charset_normalizer (2.1.1) doesn't match a supported version! warnings.warn( /root/anaconda3/lib/python3.8/site-packages/requests/init.py:109: RequestsDependencyWarning: urllib3 (1.26.14) or chardet (2.1.1)/charset_normalizer (2.1.1) doesn't match a supported version! warnings.warn( ====================== Initialization ====================== rank : 0 local_rank : 0 world_size : 2 local_size : 2 master : localhost:1234 device : 0 cpus : [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1 3, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 2 4, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 3 5, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 4 6, 47]

====================== Initialization ====================== rank : 1 local_rank : 1 world_size : 2 local_size : 2 master : localhost:1234 device : 1 cpus : [48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95]

root ├── encoder (Encoder) │ ├── layers (TransformerBlockList) │ │ └── 0-47(CheckpointBlock) │ │ ├── self_att (SelfAttentionBlock) │ │ │ ├── layernorm_before_attention (LayerNorm) weight:[0] │ │ │ └── self_attention (Attention) │ │ │ ├── project_q,project_v(Linear) weight:[0] │ │ │ │ └── lora (LowRankLinear) lora_A:[8, 4096] lora_B:[4096, 8] │ │ │ └── project_k,attention_out(Linear) weight:[0] │ │ └── ffn (FFNBlock) │ │ ├── layernorm_before_ffn (LayerNorm) weight:[0] │ │ └── ffn (FeedForward) │ │ ├── w_in (DenseGatedACT) │ │ │ ├── w_0 (Linear) weight:[12587008] │ │ │ └── w_1 (Linear) weight:[41943040] │ │ └── w_out (Linear) weight:[41943040] │ └── output_layernorm (LayerNorm) weight:[2048] ├── segment_embedding (Embedding) weight:[65536] ├── input_embedding (Embedding) weight:[179410944] └── position_bias (SegmentPositionEmbedding) relative_attention_bias:[24576] [INFO|(OpenDelta)basemodel:696]2023-02-17 19:26:12,415 >> Trainable Ratio: 6291456/4816502784=0.130623% [INFO|(OpenDelta)basemodel:698]2023-02-17 19:26:12,415 >> Delta Parameter Ratio: 6291456/4816502784=0.130623% [INFO|(OpenDelta)basemodel:700]2023-02-17 19:26:12,415 >> Static Memory 8.97 GB, Max Memory 9.63 GB root ├── encoder (Encoder) │ ├── layers (TransformerBlockList) │ │ └── 0-47(CheckpointBlock) │ │ ├── self_att (SelfAttentionBlock) │ │ │ ├── layernorm_before_attention (LayerNorm) weight:[4096] │ │ │ └── self_attention (Attention) │ │ │ ├── project_q,project_v(Linear) weight:[16777216] │ │ │ │ └── lora (LowRankLinear) lora_A:[8, 4096] lora_B:[4096, 8] │ │ │ └── project_k,attention_out(Linear) weight:[16777216] │ │ └── ffn (FFNBlock) │ │ ├── layernorm_before_ffn (LayerNorm) weight:[4096] │ │ └── ffn (FeedForward) │ │ ├── w_in (DenseGatedACT) │ │ │ ├── w_0 (Linear) weight:[29356032] │ │ │ └── w_1 (Linear) weight:[0] │ │ └── w_out (Linear) weight:[0] │ └── output_layernorm (LayerNorm) weight:[2048] ├── segment_embedding (Embedding) weight:[65536] ├── input_embedding (Embedding) weight:[179410944] └── position_bias (SegmentPositionEmbedding) relative_attention_bias:[24576] [INFO|(OpenDelta)basemodel:696]2023-02-17 19:26:12,508 >> Trainable Ratio: 6291456/4816502784=0.130623% [INFO|(OpenDelta)basemodel:698]2023-02-17 19:26:12,509 >> Delta Parameter Ratio: 6291456/4816502784=0.130623% [INFO|(OpenDelta)basemodel:700]2023-02-17 19:26:12,509 >> Static Memory 8.97 GB, Max Memory 10.30 GB [INFO] Tuning begins... Traceback (most recent call last): File "tune_cpm_ant.py", line 47, in tune.run(data) File "/search/ai/kaitongyang/cpm_ant_plus/CPM-Live/cpm-live/examples/tune.py", line 220, in run self.forward(train_dataloader, eval_dataloader, cls_num=self.cls_num) File "/search/ai/kaitongyang/cpm_ant_plus/CPM-Live/cpm-live/examples/tune.py", line 121, in forward global_loss = bmt.sum_loss(loss).item() File "/root/anaconda3/lib/python3.8/site-packages/bmtrain/synchronize.py", line 34, in sum_loss return distributed.all_reduce(loss, "avg") File "/root/anaconda3/lib/python3.8/site-packages/bmtrain/distributed/ops.py", line 92, in all_reduce return OpAllReduce.apply(x, op) File "/root/anaconda3/lib/python3.8/site-packages/bmtrain/distributed/ops.py", line 50, in forward ncclAllReduce( File "/root/anaconda3/lib/python3.8/site-packages/bmtrain/nccl/init.py", line 118, in allReduce C.ncclAllReduce( RuntimeError: NCCL Error: invalid argument Traceback (most recent call last): File "tune_cpm_ant.py", line 47, in tune.run(data) File "/search/ai/kaitongyang/cpm_ant_plus/CPM-Live/cpm-live/examples/tune.py", line 220, in run self.forward(train_dataloader, eval_dataloader, cls_num=self.cls_num) File "/search/ai/kaitongyang/cpm_ant_plus/CPM-Live/cpm-live/examples/tune.py", line 121, in forward global_loss = bmt.sum_loss(loss).item() File "/root/anaconda3/lib/python3.8/site-packages/bmtrain/synchronize.py", line 34, in sum_loss return distributed.all_reduce(loss, "avg") File "/root/anaconda3/lib/python3.8/site-packages/bmtrain/distributed/ops.py", line 92, in all_reduce return OpAllReduce.apply(x, op) File "/root/anaconda3/lib/python3.8/site-packages/bmtrain/distributed/ops.py", line 50, in forward ncclAllReduce( File "/root/anaconda3/lib/python3.8/site-packages/bmtrain/nccl/init.py", line 118, in allReduce C.ncclAllReduce( RuntimeError: NCCL Error: invalid argument ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 308683) of binary: /root/anaconda3/bin/python3 Traceback (most recent call last): File "/root/anaconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/root/anaconda3/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/root/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in main() File "/root/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/root/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/root/anaconda3/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run elastic_launch( File "/root/anaconda3/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/root/anaconda3/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

     tune_cpm_ant.py FAILED

======================================= Root Cause: [0]: time: 2023-02-17_19:26:16 rank: 0 (local_rank: 0) exitcode: 1 (pid: 308683) error_file: <N/A> msg: "Process failed with exitcode 1"

Other Failures: [1]: time: 2023-02-17_19:26:16 rank: 1 (local_rank: 1) exitcode: 1 (pid: 308684) error_file: <N/A> msg: "Process failed with exitcode 1"

a710128 commented 1 year ago

RuntimeError: NCCL Error: invalid argument

这个问题应该是pytorch CUDA版本和本机CUDA版本不一致的问题

Mryangkaitong commented 1 year ago

确实是，我换了个版本，目前跑起来了（用了7卡）。不过有一个疑问就是我在https://github.com/OpenBMB/CPM-Live/blob/cpm-ant-plus/cpm-live/examples/tune.py#L136 这里加了一行代码bmt.print_rank("{}:{}".format(global_step, bmt.rank()))，我理解bmt.rank()就是代表着GPU的卡号，但是打印出来的结果全是卡0，没有想象中的会有0,1,2,3,4,5,6,7。而且看GPU也确实7卡都用上了。请问这个怎么理解呢？感谢，如下：

企业微信截图_c86c85bf-3975-44a1-a548-9bfd909bb019

a710128 commented 1 year ago

bmtrain.print_rank的意思是在rank0上打印内容。如果你想在每个rank上都打印，可以直接用print

Mryangkaitong commented 1 year ago

好的，感谢哈，看了fintune的代码，采用的是LoRA方式训练的，相当于冻结了大部分层，只更新一些新增加的deltas层；如果想更新全部模型参数的话，需要怎么改呢？是不是把delta_model部分注释掉就好了(cpm_ant_plus/CPM-Live/cpm-live/examples/tune_cpm_ant.py)

企业微信截图_61bafb5b-ed8e-465e-8c05-f54adfa870cc

我试了试KdConv_film上面的效果，使用LoRA方式训练的结果还可以，但是如果采用更新全部参数的话结果很不好，那是这个操作（是不是把delta_model部分注释掉就好了）有问题吗？

zh-zheng commented 1 year ago

我试了试KdConv_film上面的效果，使用LoRA方式训练的结果还可以，但是如果采用更新全部参数的话结果很不好，那是这个操作（是不是把delta_model部分注释掉就好了）有问题吗？

这个操作没问题。你检查下是不是infer_cpm_ant.py的LoRA部分没有注释掉？

Mryangkaitong commented 1 year ago

注释掉了，如下：

企业微信截图_99b48a2f-b2be-4b2f-8939-29405ce9d4a4

zh-zheng commented 1 year ago

第26行不应该注释掉，不然你没有load训练好的模型(best.pt)

OpenBMB / CPM-Live

使用cpm_ant_plus 来运行 scripts/CCPM_ddp.sh 报错 #352

======================================= Root Cause: [0]: time: 2023-02-17_19:26:16 rank: 0 (local_rank: 0) exitcode: 1 (pid: 308683) error_file: <N/A> msg: "Process failed with exitcode 1"