Train the glm-10B-chinese model using 4 V100 GPUs, with no error logs printed, and then exit

Ant0082 commented 1 year ago

bash scripts/ds_finetune_seq2seq.sh config_tasks/model_blocklm_10B_chinese.sh config_tasks/seq_customization.sh

.......
using world size: 4 and model-parallel size: 1 
 > using dynamic loss scaling
> initializing model parallel with size 1
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
{'pad': 50000, 'eos': 50000, 'sep': 50001, 'ENC': 50002, 'MASK': 50003, 'unk': 50004, 'sop': 50006, 'eop': 50007, 'gMASK': 50007, 'sMASK': 50008}
> padded vocab (size: 50009) with 39 dummy tokens (new size: 50048)
> found end-of-document token: 50000
VM-0-71-centos:17750:17750 [3] NCCL INFO Bootstrap : Using eth0:10.0.0.71<0>
VM-0-71-centos:17750:17750 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

VM-0-71-centos:17750:17750 [3] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
VM-0-71-centos:17750:17750 [3] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.71<0> [1]vethfd92a29:fe80::70d3:43ff:fee3:a5d7%vethfd92a29<0>
VM-0-71-centos:17750:17750 [3] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.3
VM-0-71-centos:17750:17811 [3] NCCL INFO Channel 00/32 :    0
VM-0-71-centos:17750:17811 [3] NCCL INFO Channel 01/32 :    0
VM-0-71-centos:17750:17811 [3] NCCL INFO Channel 02/32 :    0
VM-0-71-centos:17750:17811 [3] NCCL INFO Channel 03/32 :    0
VM-0-71-centos:17750:17811 [3] NCCL INFO Channel 04/32 :    0
VM-0-71-centos:17750:17811 [3] NCCL INFO Channel 05/32 :    0
VM-0-71-centos:17750:17811 [3] NCCL INFO Channel 06/32 :    0
VM-0-71-centos:17750:17811 [3] NCCL INFO Channel 07/32 :    0
VM-0-71-centos:17750:17811 [3] NCCL INFO Channel 08/32 :    0
VM-0-71-centos:17750:17811 [3] NCCL INFO Channel 09/32 :    0
VM-0-71-centos:17750:17811 [3] NCCL INFO Channel 10/32 :    0
VM-0-71-centos:17750:17811 [3] NCCL INFO Channel 11/32 :    0
VM-0-71-centos:17750:17811 [3] NCCL INFO Channel 12/32 :    0
VM-0-71-centos:17750:17811 [3] NCCL INFO Channel 13/32 :    0
VM-0-71-centos:17750:17811 [3] NCCL INFO Channel 14/32 :    0
VM-0-71-centos:17750:17811 [3] NCCL INFO Channel 15/32 :    0
VM-0-71-centos:17750:17811 [3] NCCL INFO Channel 16/32 :    0
VM-0-71-centos:17750:17811 [3] NCCL INFO Channel 17/32 :    0
VM-0-71-centos:17750:17811 [3] NCCL INFO Channel 18/32 :    0
VM-0-71-centos:17750:17811 [3] NCCL INFO Channel 19/32 :    0
VM-0-71-centos:17750:17811 [3] NCCL INFO Channel 20/32 :    0
VM-0-71-centos:17750:17811 [3] NCCL INFO Channel 21/32 :    0
VM-0-71-centos:17750:17811 [3] NCCL INFO Channel 22/32 :    0
VM-0-71-centos:17750:17811 [3] NCCL INFO Channel 23/32 :    0
VM-0-71-centos:17750:17811 [3] NCCL INFO Channel 24/32 :    0
VM-0-71-centos:17750:17811 [3] NCCL INFO Channel 25/32 :    0
VM-0-71-centos:17750:17811 [3] NCCL INFO Channel 26/32 :    0
VM-0-71-centos:17750:17811 [3] NCCL INFO Channel 27/32 :    0
VM-0-71-centos:17750:17811 [3] NCCL INFO Channel 28/32 :    0
VM-0-71-centos:17750:17811 [3] NCCL INFO Channel 29/32 :    0
VM-0-71-centos:17750:17811 [3] NCCL INFO Channel 30/32 :    0
VM-0-71-centos:17750:17811 [3] NCCL INFO Channel 31/32 :    0
VM-0-71-centos:17750:17811 [3] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1
VM-0-71-centos:17748:17748 [1] NCCL INFO Bootstrap : Using eth0:10.0.0.71<0>
VM-0-71-centos:17748:17748 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

VM-0-71-centos:17748:17748 [1] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
VM-0-71-centos:17748:17748 [1] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.71<0> [1]vethfd92a29:fe80::70d3:43ff:fee3:a5d7%vethfd92a29<0>
VM-0-71-centos:17748:17748 [1] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.3
VM-0-71-centos:17750:17811 [3] NCCL INFO Connected all rings
VM-0-71-centos:17750:17811 [3] NCCL INFO Connected all trees
VM-0-71-centos:17750:17811 [3] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
VM-0-71-centos:17750:17811 [3] NCCL INFO comm 0x7f77a0002010 rank 0 nranks 1 cudaDev 3 busId c0 - Init COMPLETE
VM-0-71-centos:17747:17747 [0] NCCL INFO Bootstrap : Using eth0:10.0.0.71<0>
VM-0-71-centos:17747:17747 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

VM-0-71-centos:17747:17747 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
VM-0-71-centos:17747:17747 [0] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.71<0> [1]vethfd92a29:fe80::70d3:43ff:fee3:a5d7%vethfd92a29<0>
VM-0-71-centos:17747:17747 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.3
VM-0-71-centos:17748:17815 [1] NCCL INFO Channel 00/32 :    0
VM-0-71-centos:17748:17815 [1] NCCL INFO Channel 01/32 :    0
VM-0-71-centos:17748:17815 [1] NCCL INFO Channel 02/32 :    0
VM-0-71-centos:17748:17815 [1] NCCL INFO Channel 03/32 :    0
VM-0-71-centos:17748:17815 [1] NCCL INFO Channel 04/32 :    0
VM-0-71-centos:17748:17815 [1] NCCL INFO Channel 05/32 :    0
VM-0-71-centos:17748:17815 [1] NCCL INFO Channel 06/32 :    0
VM-0-71-centos:17748:17815 [1] NCCL INFO Channel 07/32 :    0
VM-0-71-centos:17748:17815 [1] NCCL INFO Channel 08/32 :    0
VM-0-71-centos:17748:17815 [1] NCCL INFO Channel 09/32 :    0
VM-0-71-centos:17748:17815 [1] NCCL INFO Channel 10/32 :    0
VM-0-71-centos:17748:17815 [1] NCCL INFO Channel 11/32 :    0
VM-0-71-centos:17748:17815 [1] NCCL INFO Channel 12/32 :    0
VM-0-71-centos:17748:17815 [1] NCCL INFO Channel 13/32 :    0
VM-0-71-centos:17748:17815 [1] NCCL INFO Channel 14/32 :    0
VM-0-71-centos:17748:17815 [1] NCCL INFO Channel 15/32 :    0
VM-0-71-centos:17748:17815 [1] NCCL INFO Channel 16/32 :    0
VM-0-71-centos:17748:17815 [1] NCCL INFO Channel 17/32 :    0
VM-0-71-centos:17748:17815 [1] NCCL INFO Channel 18/32 :    0
VM-0-71-centos:17748:17815 [1] NCCL INFO Channel 19/32 :    0
VM-0-71-centos:17748:17815 [1] NCCL INFO Channel 20/32 :    0
VM-0-71-centos:17748:17815 [1] NCCL INFO Channel 21/32 :    0
VM-0-71-centos:17748:17815 [1] NCCL INFO Channel 22/32 :    0
VM-0-71-centos:17748:17815 [1] NCCL INFO Channel 23/32 :    0
VM-0-71-centos:17748:17815 [1] NCCL INFO Channel 24/32 :    0
VM-0-71-centos:17748:17815 [1] NCCL INFO Channel 25/32 :    0
VM-0-71-centos:17748:17815 [1] NCCL INFO Channel 26/32 :    0
VM-0-71-centos:17748:17815 [1] NCCL INFO Channel 27/32 :    0
VM-0-71-centos:17748:17815 [1] NCCL INFO Channel 28/32 :    0
VM-0-71-centos:17748:17815 [1] NCCL INFO Channel 29/32 :    0
VM-0-71-centos:17748:17815 [1] NCCL INFO Channel 30/32 :    0
VM-0-71-centos:17748:17815 [1] NCCL INFO Channel 31/32 :    0
VM-0-71-centos:17748:17815 [1] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1
VM-0-71-centos:17749:17749 [2] NCCL INFO Bootstrap : Using eth0:10.0.0.71<0>
VM-0-71-centos:17749:17749 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

VM-0-71-centos:17749:17749 [2] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
VM-0-71-centos:17749:17749 [2] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.71<0> [1]vethfd92a29:fe80::70d3:43ff:fee3:a5d7%vethfd92a29<0>
VM-0-71-centos:17749:17749 [2] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.3
VM-0-71-centos:17748:17815 [1] NCCL INFO Connected all rings
VM-0-71-centos:17748:17815 [1] NCCL INFO Connected all trees
VM-0-71-centos:17748:17815 [1] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
VM-0-71-centos:17748:17815 [1] NCCL INFO comm 0x7fd9c8002010 rank 0 nranks 1 cudaDev 1 busId a0 - Init COMPLETE
VM-0-71-centos:17747:17818 [0] NCCL INFO Channel 00/32 :    0
VM-0-71-centos:17747:17818 [0] NCCL INFO Channel 01/32 :    0
VM-0-71-centos:17747:17818 [0] NCCL INFO Channel 02/32 :    0
VM-0-71-centos:17747:17818 [0] NCCL INFO Channel 03/32 :    0
VM-0-71-centos:17747:17818 [0] NCCL INFO Channel 04/32 :    0
VM-0-71-centos:17747:17818 [0] NCCL INFO Channel 05/32 :    0
VM-0-71-centos:17747:17818 [0] NCCL INFO Channel 06/32 :    0
VM-0-71-centos:17747:17818 [0] NCCL INFO Channel 07/32 :    0
VM-0-71-centos:17747:17818 [0] NCCL INFO Channel 08/32 :    0
VM-0-71-centos:17747:17818 [0] NCCL INFO Channel 09/32 :    0
VM-0-71-centos:17747:17818 [0] NCCL INFO Channel 10/32 :    0
VM-0-71-centos:17747:17818 [0] NCCL INFO Channel 11/32 :    0
VM-0-71-centos:17747:17818 [0] NCCL INFO Channel 12/32 :    0
VM-0-71-centos:17747:17818 [0] NCCL INFO Channel 13/32 :    0
VM-0-71-centos:17747:17818 [0] NCCL INFO Channel 14/32 :    0
VM-0-71-centos:17747:17818 [0] NCCL INFO Channel 15/32 :    0
VM-0-71-centos:17747:17818 [0] NCCL INFO Channel 16/32 :    0
VM-0-71-centos:17747:17818 [0] NCCL INFO Channel 17/32 :    0
VM-0-71-centos:17747:17818 [0] NCCL INFO Channel 18/32 :    0
VM-0-71-centos:17747:17818 [0] NCCL INFO Channel 19/32 :    0
VM-0-71-centos:17747:17818 [0] NCCL INFO Channel 20/32 :    0
VM-0-71-centos:17747:17818 [0] NCCL INFO Channel 21/32 :    0
VM-0-71-centos:17747:17818 [0] NCCL INFO Channel 22/32 :    0
VM-0-71-centos:17747:17818 [0] NCCL INFO Channel 23/32 :    0
VM-0-71-centos:17747:17818 [0] NCCL INFO Channel 24/32 :    0
VM-0-71-centos:17747:17818 [0] NCCL INFO Channel 25/32 :    0
VM-0-71-centos:17747:17818 [0] NCCL INFO Channel 26/32 :    0
VM-0-71-centos:17747:17818 [0] NCCL INFO Channel 27/32 :    0
VM-0-71-centos:17747:17818 [0] NCCL INFO Channel 28/32 :    0
VM-0-71-centos:17747:17818 [0] NCCL INFO Channel 29/32 :    0
VM-0-71-centos:17747:17818 [0] NCCL INFO Channel 30/32 :    0
VM-0-71-centos:17747:17818 [0] NCCL INFO Channel 31/32 :    0
VM-0-71-centos:17747:17818 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1
VM-0-71-centos:17749:17821 [2] NCCL INFO Channel 00/32 :    0
VM-0-71-centos:17749:17821 [2] NCCL INFO Channel 01/32 :    0
VM-0-71-centos:17749:17821 [2] NCCL INFO Channel 02/32 :    0
VM-0-71-centos:17749:17821 [2] NCCL INFO Channel 03/32 :    0
VM-0-71-centos:17749:17821 [2] NCCL INFO Channel 04/32 :    0
VM-0-71-centos:17749:17821 [2] NCCL INFO Channel 05/32 :    0
VM-0-71-centos:17749:17821 [2] NCCL INFO Channel 06/32 :    0
VM-0-71-centos:17749:17821 [2] NCCL INFO Channel 07/32 :    0
VM-0-71-centos:17749:17821 [2] NCCL INFO Channel 08/32 :    0
VM-0-71-centos:17749:17821 [2] NCCL INFO Channel 09/32 :    0
VM-0-71-centos:17749:17821 [2] NCCL INFO Channel 10/32 :    0
VM-0-71-centos:17749:17821 [2] NCCL INFO Channel 11/32 :    0
VM-0-71-centos:17749:17821 [2] NCCL INFO Channel 12/32 :    0
VM-0-71-centos:17749:17821 [2] NCCL INFO Channel 13/32 :    0
VM-0-71-centos:17749:17821 [2] NCCL INFO Channel 14/32 :    0
VM-0-71-centos:17749:17821 [2] NCCL INFO Channel 15/32 :    0
VM-0-71-centos:17749:17821 [2] NCCL INFO Channel 16/32 :    0
VM-0-71-centos:17749:17821 [2] NCCL INFO Channel 17/32 :    0
VM-0-71-centos:17749:17821 [2] NCCL INFO Channel 18/32 :    0
VM-0-71-centos:17749:17821 [2] NCCL INFO Channel 19/32 :    0
VM-0-71-centos:17749:17821 [2] NCCL INFO Channel 20/32 :    0
VM-0-71-centos:17749:17821 [2] NCCL INFO Channel 21/32 :    0
VM-0-71-centos:17749:17821 [2] NCCL INFO Channel 22/32 :    0
VM-0-71-centos:17749:17821 [2] NCCL INFO Channel 23/32 :    0
VM-0-71-centos:17749:17821 [2] NCCL INFO Channel 24/32 :    0
VM-0-71-centos:17749:17821 [2] NCCL INFO Channel 25/32 :    0
VM-0-71-centos:17749:17821 [2] NCCL INFO Channel 26/32 :    0
VM-0-71-centos:17749:17821 [2] NCCL INFO Channel 27/32 :    0
VM-0-71-centos:17749:17821 [2] NCCL INFO Channel 28/32 :    0
VM-0-71-centos:17749:17821 [2] NCCL INFO Channel 29/32 :    0
VM-0-71-centos:17749:17821 [2] NCCL INFO Channel 30/32 :    0
VM-0-71-centos:17749:17821 [2] NCCL INFO Channel 31/32 :    0
VM-0-71-centos:17749:17821 [2] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1
VM-0-71-centos:17747:17818 [0] NCCL INFO Connected all rings
VM-0-71-centos:17747:17818 [0] NCCL INFO Connected all trees
VM-0-71-centos:17747:17818 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
VM-0-71-centos:17747:17818 [0] NCCL INFO comm 0x7f7058002010 rank 0 nranks 1 cudaDev 0 busId 90 - Init COMPLETE
Creating customization-train dataset from data/customization
Return 423 train examples
building train and validation dataloaders ...
Creating customization-dev dataset from data/customization
Return 423 dev examples
Creating customization-test dataset from data/customization
Return 423 test examples
building GPT2 model ...
VM-0-71-centos:17749:17821 [2] NCCL INFO Connected all rings
VM-0-71-centos:17749:17821 [2] NCCL INFO Connected all trees
VM-0-71-centos:17749:17821 [2] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
VM-0-71-centos:17749:17821 [2] NCCL INFO comm 0x7f6310002010 rank 0 nranks 1 cudaDev 2 busId b0 - Init COMPLETE
 > number of parameters on model parallel rank 0: 9879633920
[2023-01-04 18:56:26,028] [INFO] [logging.py:68:log_dist] [Rank -1] DeepSpeed info: version=0.7.4, git-hash=unknown, git-branch=unknown
[2023-01-04 18:56:26,046] [WARNING] [config_utils.py:64:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
[2023-01-04 18:56:26,398] [INFO] [logging.py:68:log_dist] [Rank -1] DeepSpeed info: version=0.7.4, git-hash=unknown, git-branch=unknown
[2023-01-04 18:56:26,403] [WARNING] [config_utils.py:64:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
[2023-01-04 18:56:26,502] [INFO] [logging.py:68:log_dist] [Rank -1] DeepSpeed info: version=0.7.4, git-hash=unknown, git-branch=unknown
[2023-01-04 18:56:26,507] [WARNING] [config_utils.py:64:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
DeepSpeed is enabled.
[2023-01-04 18:56:27,189] [INFO] [logging.py:68:log_dist] [Rank -1] DeepSpeed info: version=0.7.4, git-hash=unknown, git-branch=unknown
[2023-01-04 18:56:27,194] [WARNING] [config_utils.py:64:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
VM-0-71-centos:17747:18264 [0] NCCL INFO Channel 00/08 :    0   2   3   1
VM-0-71-centos:17748:18265 [1] NCCL INFO Trees [0] -1/-1/-1->1->3 [1] 3/-1/-1->1->-1 [2] -1/-1/-1->1->3 [3] 3/-1/-1->1->-1 [4] -1/-1/-1->1->3 [5] 3/-1/-1->1->-1 [6] -1/-1/-1->1->3 [7] 3/-1/-1->1->-1
VM-0-71-centos:17747:18264 [0] NCCL INFO Channel 01/08 :    0   2   1   3
VM-0-71-centos:17749:18266 [2] NCCL INFO Trees [0] 3/-1/-1->2->0 [1] 0/-1/-1->2->3 [2] 3/-1/-1->2->0 [3] 0/-1/-1->2->3 [4] 3/-1/-1->2->0 [5] 0/-1/-1->2->3 [6] 3/-1/-1->2->0 [7] 0/-1/-1->2->3
VM-0-71-centos:17747:18264 [0] NCCL INFO Channel 02/08 :    0   1   3   2
VM-0-71-centos:17747:18264 [0] NCCL INFO Channel 03/08 :    0   3   1   2
VM-0-71-centos:17750:18267 [3] NCCL INFO Trees [0] 1/-1/-1->3->2 [1] 2/-1/-1->3->1 [2] 1/-1/-1->3->2 [3] 2/-1/-1->3->1 [4] 1/-1/-1->3->2 [5] 2/-1/-1->3->1 [6] 1/-1/-1->3->2 [7] 2/-1/-1->3->1
VM-0-71-centos:17747:18264 [0] NCCL INFO Channel 04/08 :    0   2   3   1
VM-0-71-centos:17747:18264 [0] NCCL INFO Channel 05/08 :    0   2   1   3
VM-0-71-centos:17747:18264 [0] NCCL INFO Channel 06/08 :    0   1   3   2
VM-0-71-centos:17747:18264 [0] NCCL INFO Channel 07/08 :    0   3   1   2
VM-0-71-centos:17747:18264 [0] NCCL INFO Trees [0] 2/-1/-1->0->-1 [1] -1/-1/-1->0->2 [2] 2/-1/-1->0->-1 [3] -1/-1/-1->0->2 [4] 2/-1/-1->0->-1 [5] -1/-1/-1->0->2 [6] 2/-1/-1->0->-1 [7] -1/-1/-1->0->2
VM-0-71-centos:17748:18265 [1] NCCL INFO Channel 03 : 1[a0] -> 2[b0] via P2P/IPC
VM-0-71-centos:17750:18267 [3] NCCL INFO Channel 01 : 3[c0] -> 0[90] via P2P/IPC
VM-0-71-centos:17749:18266 [2] NCCL INFO Channel 00 : 2[b0] -> 3[c0] via P2P/IPC
VM-0-71-centos:17747:18264 [0] NCCL INFO Channel 02 : 0[90] -> 1[a0] via P2P/IPC
VM-0-71-centos:17748:18265 [1] NCCL INFO Channel 07 : 1[a0] -> 2[b0] via P2P/IPC
VM-0-71-centos:17750:18267 [3] NCCL INFO Channel 05 : 3[c0] -> 0[90] via P2P/IPC
VM-0-71-centos:17749:18266 [2] NCCL INFO Channel 04 : 2[b0] -> 3[c0] via P2P/IPC
VM-0-71-centos:17747:18264 [0] NCCL INFO Channel 06 : 0[90] -> 1[a0] via P2P/IPC
VM-0-71-centos:17750:18267 [3] NCCL INFO Channel 00 : 3[c0] -> 1[a0] via P2P/IPC
VM-0-71-centos:17749:18266 [2] NCCL INFO Channel 02 : 2[b0] -> 0[90] via P2P/IPC
VM-0-71-centos:17748:18265 [1] NCCL INFO Channel 01 : 1[a0] -> 3[c0] via P2P/IPC
VM-0-71-centos:17747:18264 [0] NCCL INFO Channel 00 : 0[90] -> 2[b0] via P2P/IPC
VM-0-71-centos:17750:18267 [3] NCCL INFO Channel 03 : 3[c0] -> 1[a0] via P2P/IPC
VM-0-71-centos:17749:18266 [2] NCCL INFO Channel 03 : 2[b0] -> 0[90] via P2P/IPC
VM-0-71-centos:17748:18265 [1] NCCL INFO Channel 02 : 1[a0] -> 3[c0] via P2P/IPC
VM-0-71-centos:17747:18264 [0] NCCL INFO Channel 01 : 0[90] -> 2[b0] via P2P/IPC
VM-0-71-centos:17750:18267 [3] NCCL INFO Channel 04 : 3[c0] -> 1[a0] via P2P/IPC
VM-0-71-centos:17749:18266 [2] NCCL INFO Channel 06 : 2[b0] -> 0[90] via P2P/IPC
VM-0-71-centos:17748:18265 [1] NCCL INFO Channel 05 : 1[a0] -> 3[c0] via P2P/IPC
VM-0-71-centos:17747:18264 [0] NCCL INFO Channel 04 : 0[90] -> 2[b0] via P2P/IPC
VM-0-71-centos:17750:18267 [3] NCCL INFO Channel 07 : 3[c0] -> 1[a0] via P2P/IPC
VM-0-71-centos:17749:18266 [2] NCCL INFO Channel 07 : 2[b0] -> 0[90] via P2P/IPC
VM-0-71-centos:17748:18265 [1] NCCL INFO Channel 06 : 1[a0] -> 3[c0] via P2P/IPC
VM-0-71-centos:17747:18264 [0] NCCL INFO Channel 05 : 0[90] -> 2[b0] via P2P/IPC
VM-0-71-centos:17750:18267 [3] NCCL INFO Channel 02 : 3[c0] -> 2[b0] via P2P/IPC
VM-0-71-centos:17748:18265 [1] NCCL INFO Channel 00 : 1[a0] -> 0[90] via P2P/IPC
VM-0-71-centos:17749:18266 [2] NCCL INFO Channel 01 : 2[b0] -> 1[a0] via P2P/IPC
VM-0-71-centos:17747:18264 [0] NCCL INFO Channel 03 : 0[90] -> 3[c0] via P2P/IPC
VM-0-71-centos:17750:18267 [3] NCCL INFO Channel 06 : 3[c0] -> 2[b0] via P2P/IPC
VM-0-71-centos:17748:18265 [1] NCCL INFO Channel 04 : 1[a0] -> 0[90] via P2P/IPC
VM-0-71-centos:17749:18266 [2] NCCL INFO Channel 05 : 2[b0] -> 1[a0] via P2P/IPC
VM-0-71-centos:17747:18264 [0] NCCL INFO Channel 07 : 0[90] -> 3[c0] via P2P/IPC
VM-0-71-centos:17748:18265 [1] NCCL INFO Connected all rings
VM-0-71-centos:17749:18266 [2] NCCL INFO Connected all rings
VM-0-71-centos:17750:18267 [3] NCCL INFO Connected all rings
VM-0-71-centos:17747:18264 [0] NCCL INFO Connected all rings
VM-0-71-centos:17749:18266 [2] NCCL INFO Channel 01 : 2[b0] -> 3[c0] via P2P/IPC
VM-0-71-centos:17749:18266 [2] NCCL INFO Channel 02 : 2[b0] -> 3[c0] via P2P/IPC
VM-0-71-centos:17749:18266 [2] NCCL INFO Channel 03 : 2[b0] -> 3[c0] via P2P/IPC
VM-0-71-centos:17749:18266 [2] NCCL INFO Channel 05 : 2[b0] -> 3[c0] via P2P/IPC
VM-0-71-centos:17748:18265 [1] NCCL INFO Channel 00 : 1[a0] -> 3[c0] via P2P/IPC
VM-0-71-centos:17749:18266 [2] NCCL INFO Channel 06 : 2[b0] -> 3[c0] via P2P/IPC
VM-0-71-centos:17747:18264 [0] NCCL INFO Channel 02 : 0[90] -> 2[b0] via P2P/IPC
VM-0-71-centos:17748:18265 [1] NCCL INFO Channel 03 : 1[a0] -> 3[c0] via P2P/IPC
VM-0-71-centos:17749:18266 [2] NCCL INFO Channel 07 : 2[b0] -> 3[c0] via P2P/IPC
VM-0-71-centos:17747:18264 [0] NCCL INFO Channel 03 : 0[90] -> 2[b0] via P2P/IPC
VM-0-71-centos:17748:18265 [1] NCCL INFO Channel 04 : 1[a0] -> 3[c0] via P2P/IPC
VM-0-71-centos:17747:18264 [0] NCCL INFO Channel 06 : 0[90] -> 2[b0] via P2P/IPC
VM-0-71-centos:17748:18265 [1] NCCL INFO Channel 07 : 1[a0] -> 3[c0] via P2P/IPC
VM-0-71-centos:17747:18264 [0] NCCL INFO Channel 07 : 0[90] -> 2[b0] via P2P/IPC
VM-0-71-centos:17750:18267 [3] NCCL INFO Channel 01 : 3[c0] -> 1[a0] via P2P/IPC
VM-0-71-centos:17749:18266 [2] NCCL INFO Channel 00 : 2[b0] -> 0[90] via P2P/IPC
VM-0-71-centos:17750:18267 [3] NCCL INFO Channel 02 : 3[c0] -> 1[a0] via P2P/IPC
VM-0-71-centos:17749:18266 [2] NCCL INFO Channel 01 : 2[b0] -> 0[90] via P2P/IPC
VM-0-71-centos:17750:18267 [3] NCCL INFO Channel 05 : 3[c0] -> 1[a0] via P2P/IPC
VM-0-71-centos:17749:18266 [2] NCCL INFO Channel 04 : 2[b0] -> 0[90] via P2P/IPC
VM-0-71-centos:17750:18267 [3] NCCL INFO Channel 06 : 3[c0] -> 1[a0] via P2P/IPC
VM-0-71-centos:17749:18266 [2] NCCL INFO Channel 05 : 2[b0] -> 0[90] via P2P/IPC
VM-0-71-centos:17748:18265 [1] NCCL INFO Connected all trees
VM-0-71-centos:17748:18265 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
VM-0-71-centos:17748:18265 [1] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
VM-0-71-centos:17747:18264 [0] NCCL INFO Connected all trees
VM-0-71-centos:17747:18264 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
VM-0-71-centos:17747:18264 [0] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
VM-0-71-centos:17750:18267 [3] NCCL INFO Channel 00 : 3[c0] -> 2[b0] via P2P/IPC
VM-0-71-centos:17750:18267 [3] NCCL INFO Channel 01 : 3[c0] -> 2[b0] via P2P/IPC
VM-0-71-centos:17750:18267 [3] NCCL INFO Channel 03 : 3[c0] -> 2[b0] via P2P/IPC
VM-0-71-centos:17750:18267 [3] NCCL INFO Channel 04 : 3[c0] -> 2[b0] via P2P/IPC
VM-0-71-centos:17750:18267 [3] NCCL INFO Channel 05 : 3[c0] -> 2[b0] via P2P/IPC
VM-0-71-centos:17750:18267 [3] NCCL INFO Channel 07 : 3[c0] -> 2[b0] via P2P/IPC
VM-0-71-centos:17749:18266 [2] NCCL INFO Connected all trees
VM-0-71-centos:17749:18266 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
VM-0-71-centos:17749:18266 [2] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
VM-0-71-centos:17750:18267 [3] NCCL INFO Connected all trees
VM-0-71-centos:17750:18267 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
VM-0-71-centos:17750:18267 [3] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
VM-0-71-centos:17749:18266 [2] NCCL INFO comm 0x7f5e98002010 rank 2 nranks 4 cudaDev 2 busId b0 - Init COMPLETE
VM-0-71-centos:17750:18267 [3] NCCL INFO comm 0x7f7320002010 rank 3 nranks 4 cudaDev 3 busId c0 - Init COMPLETE
VM-0-71-centos:17747:18264 [0] NCCL INFO comm 0x7f6bd0002010 rank 0 nranks 4 cudaDev 0 busId 90 - Init COMPLETE
VM-0-71-centos:17748:18265 [1] NCCL INFO comm 0x7fd550002010 rank 1 nranks 4 cudaDev 1 busId a0 - Init COMPLETE
VM-0-71-centos:17747:17747 [0] NCCL INFO Launch mode Parallel
[2023-01-04 18:56:27,405] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Installed CUDA version 11.4 does not match the version torch was compiled with 11.3 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.4 does not match the version torch was compiled with 11.3 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.4 does not match the version torch was compiled with 11.3 but since the APIs are compatible, accepting this combinationInstalled CUDA version 11.4 does not match the version torch was compiled with 11.3 but since the APIs are compatible, accepting this combination

Using /root/.cache/torch_extensions/py37_cu113 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py37_cu113 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py37_cu113 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py37_cu113 as PyTorch extensions root...
/data/qin/miniconda3/envs/bmb_env/lib/python3.7/site-packages/torch/utils/cpp_extension.py:353: UserWarning: 

                               !! WARNING !!

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Your compiler (c++ 4.8.5) may be ABI-incompatible with PyTorch!
Please use a compiler that is ABI-compatible with GCC 5.0 and above.
See https://gcc.gnu.org/onlinedocs/libstdc++/manual/abi.html.

See https://gist.github.com/goldsborough/d466f43e8ffc948ff92de7486c5216d6
for instructions on how to install GCC 5 or higher.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

                              !! WARNING !!

  warnings.warn(ABI_INCOMPATIBILITY_WARNING.format(compiler))
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py37_cu113/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 0.5814266204833984 seconds
Loading extension module cpu_adam...
Loading extension module cpu_adam...
Loading extension module cpu_adam...
Time to load cpu_adam op: 0.639704704284668 seconds
Time to load cpu_adam op: 0.6396811008453369 seconds
Time to load cpu_adam op: 0.6400058269500732 seconds
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.000005, betas=(0.900000, 0.950000), weight_decay=0.010000, adam_w=1
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.000005, betas=(0.900000, 0.950000), weight_decay=0.010000, adam_w=1
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.000005, betas=(0.900000, 0.950000), weight_decay=0.010000, adam_w=1
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.000005, betas=(0.900000, 0.950000), weight_decay=0.010000, adam_w=1
[2023-01-04 18:56:30,135] [INFO] [logging.py:68:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adam as basic optimizer
Using /root/.cache/torch_extensions/py37_cu113 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py37_cu113 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py37_cu113 as PyTorch extensions root...
[2023-01-04 18:56:30,174] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
[2023-01-04 18:56:30,174] [INFO] [utils.py:53:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
[2023-01-04 18:56:30,174] [INFO] [logging.py:68:log_dist] [Rank 0] Creating fp16 ZeRO stage 2 optimizer
[2023-01-04 18:56:30,174] [INFO] [stage_1_and_2.py:140:__init__] Reduce bucket size 50000000
[2023-01-04 18:56:30,174] [INFO] [stage_1_and_2.py:141:__init__] Allgather bucket size 50000000
[2023-01-04 18:56:30,174] [INFO] [stage_1_and_2.py:142:__init__] CPU Offload: True
[2023-01-04 18:56:30,174] [INFO] [stage_1_and_2.py:143:__init__] Round robin gradient partitioning: False
Using /root/.cache/torch_extensions/py37_cu113 as PyTorch extensions root...
/data/qin/miniconda3/envs/bmb_env/lib/python3.7/site-packages/torch/utils/cpp_extension.py:353: UserWarning: 

                               !! WARNING !!

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Your compiler (c++ 4.8.5) may be ABI-incompatible with PyTorch!
Please use a compiler that is ABI-compatible with GCC 5.0 and above.
See https://gcc.gnu.org/onlinedocs/libstdc++/manual/abi.html.

See https://gist.github.com/goldsborough/d466f43e8ffc948ff92de7486c5216d6
for instructions on how to install GCC 5 or higher.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

                              !! WARNING !!

  warnings.warn(ABI_INCOMPATIBILITY_WARNING.format(compiler))
Emitting ninja build file /root/.cache/torch_extensions/py37_cu113/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.26166439056396484 seconds
Loading extension module utils...
Time to load utils op: 0.3021583557128906 seconds
Loading extension module utils...
Time to load utils op: 0.30205678939819336 seconds
Loading extension module utils...
Time to load utils op: 0.30208468437194824 seconds
Rank: 2 partition count [4, 4] and sizes[(2469267456, False), (641024, False)] 
Rank: 1 partition count [4, 4] and sizes[(2469267456, False), (641024, False)] 
Rank: 3 partition count [4, 4] and sizes[(2469267456, False), (641024, False)] 
Rank: 0 partition count [4, 4] and sizes[(2469267456, False), (641024, False)] 
VM-0-71-centos:17747:18573 [0] NCCL INFO Channel 00/08 :    0   2   3   1
VM-0-71-centos:17747:18573 [0] NCCL INFO Channel 01/08 :    0   2   1   3
VM-0-71-centos:17747:18573 [0] NCCL INFO Channel 02/08 :    0   1   3   2
VM-0-71-centos:17747:18573 [0] NCCL INFO Channel 03/08 :    0   3   1   2
VM-0-71-centos:17747:18573 [0] NCCL INFO Channel 04/08 :    0   2   3   1
VM-0-71-centos:17747:18573 [0] NCCL INFO Channel 05/08 :    0   2   1   3
VM-0-71-centos:17747:18573 [0] NCCL INFO Channel 06/08 :    0   1   3   2
VM-0-71-centos:17747:18573 [0] NCCL INFO Channel 07/08 :    0   3   1   2
VM-0-71-centos:17748:18575 [1] NCCL INFO Trees [0] -1/-1/-1->1->3 [1] 3/-1/-1->1->-1 [2] -1/-1/-1->1->3 [3] 3/-1/-1->1->-1 [4] -1/-1/-1->1->3 [5] 3/-1/-1->1->-1 [6] -1/-1/-1->1->3 [7] 3/-1/-1->1->-1
VM-0-71-centos:17747:18573 [0] NCCL INFO Trees [0] 2/-1/-1->0->-1 [1] -1/-1/-1->0->2 [2] 2/-1/-1->0->-1 [3] -1/-1/-1->0->2 [4] 2/-1/-1->0->-1 [5] -1/-1/-1->0->2 [6] 2/-1/-1->0->-1 [7] -1/-1/-1->0->2
VM-0-71-centos:17749:18574 [2] NCCL INFO Trees [0] 3/-1/-1->2->0 [1] 0/-1/-1->2->3 [2] 3/-1/-1->2->0 [3] 0/-1/-1->2->3 [4] 3/-1/-1->2->0 [5] 0/-1/-1->2->3 [6] 3/-1/-1->2->0 [7] 0/-1/-1->2->3
VM-0-71-centos:17750:18576 [3] NCCL INFO Trees [0] 1/-1/-1->3->2 [1] 2/-1/-1->3->1 [2] 1/-1/-1->3->2 [3] 2/-1/-1->3->1 [4] 1/-1/-1->3->2 [5] 2/-1/-1->3->1 [6] 1/-1/-1->3->2 [7] 2/-1/-1->3->1
VM-0-71-centos:17749:18574 [2] NCCL INFO Channel 00 : 2[b0] -> 3[c0] via P2P/IPC
VM-0-71-centos:17748:18575 [1] NCCL INFO Channel 03 : 1[a0] -> 2[b0] via P2P/IPC
VM-0-71-centos:17749:18574 [2] NCCL INFO Channel 04 : 2[b0] -> 3[c0] via P2P/IPC
VM-0-71-centos:17748:18575 [1] NCCL INFO Channel 07 : 1[a0] -> 2[b0] via P2P/IPC
VM-0-71-centos:17747:18573 [0] NCCL INFO Channel 02 : 0[90] -> 1[a0] via P2P/IPC
VM-0-71-centos:17747:18573 [0] NCCL INFO Channel 06 : 0[90] -> 1[a0] via P2P/IPC
VM-0-71-centos:17750:18576 [3] NCCL INFO Channel 01 : 3[c0] -> 0[90] via P2P/IPC
VM-0-71-centos:17750:18576 [3] NCCL INFO Channel 05 : 3[c0] -> 0[90] via P2P/IPC
VM-0-71-centos:17748:18575 [1] NCCL INFO Channel 01 : 1[a0] -> 3[c0] via P2P/IPC
VM-0-71-centos:17748:18575 [1] NCCL INFO Channel 02 : 1[a0] -> 3[c0] via P2P/IPC
VM-0-71-centos:17748:18575 [1] NCCL INFO Channel 05 : 1[a0] -> 3[c0] via P2P/IPC
VM-0-71-centos:17748:18575 [1] NCCL INFO Channel 06 : 1[a0] -> 3[c0] via P2P/IPC
VM-0-71-centos:17749:18574 [2] NCCL INFO Channel 02 : 2[b0] -> 0[90] via P2P/IPC
VM-0-71-centos:17747:18573 [0] NCCL INFO Channel 00 : 0[90] -> 2[b0] via P2P/IPC
VM-0-71-centos:17750:18576 [3] NCCL INFO Channel 00 : 3[c0] -> 1[a0] via P2P/IPC
VM-0-71-centos:17749:18574 [2] NCCL INFO Channel 03 : 2[b0] -> 0[90] via P2P/IPC
VM-0-71-centos:17747:18573 [0] NCCL INFO Channel 01 : 0[90] -> 2[b0] via P2P/IPC
VM-0-71-centos:17750:18576 [3] NCCL INFO Channel 03 : 3[c0] -> 1[a0] via P2P/IPC
VM-0-71-centos:17749:18574 [2] NCCL INFO Channel 06 : 2[b0] -> 0[90] via P2P/IPC
VM-0-71-centos:17747:18573 [0] NCCL INFO Channel 04 : 0[90] -> 2[b0] via P2P/IPC
VM-0-71-centos:17750:18576 [3] NCCL INFO Channel 04 : 3[c0] -> 1[a0] via P2P/IPC
VM-0-71-centos:17749:18574 [2] NCCL INFO Channel 07 : 2[b0] -> 0[90] via P2P/IPC
VM-0-71-centos:17747:18573 [0] NCCL INFO Channel 05 : 0[90] -> 2[b0] via P2P/IPC
VM-0-71-centos:17750:18576 [3] NCCL INFO Channel 07 : 3[c0] -> 1[a0] via P2P/IPC
VM-0-71-centos:17749:18574 [2] NCCL INFO Channel 01 : 2[b0] -> 1[a0] via P2P/IPC
VM-0-71-centos:17747:18573 [0] NCCL INFO Channel 03 : 0[90] -> 3[c0] via P2P/IPC
VM-0-71-centos:17748:18575 [1] NCCL INFO Channel 00 : 1[a0] -> 0[90] via P2P/IPC
VM-0-71-centos:17750:18576 [3] NCCL INFO Channel 02 : 3[c0] -> 2[b0] via P2P/IPC
VM-0-71-centos:17749:18574 [2] NCCL INFO Channel 05 : 2[b0] -> 1[a0] via P2P/IPC
VM-0-71-centos:17747:18573 [0] NCCL INFO Channel 07 : 0[90] -> 3[c0] via P2P/IPC
VM-0-71-centos:17748:18575 [1] NCCL INFO Channel 04 : 1[a0] -> 0[90] via P2P/IPC
VM-0-71-centos:17750:18576 [3] NCCL INFO Channel 06 : 3[c0] -> 2[b0] via P2P/IPC
VM-0-71-centos:17748:18575 [1] NCCL INFO Connected all rings
VM-0-71-centos:17747:18573 [0] NCCL INFO Connected all rings
VM-0-71-centos:17749:18574 [2] NCCL INFO Connected all rings
VM-0-71-centos:17750:18576 [3] NCCL INFO Connected all rings
VM-0-71-centos:17749:18574 [2] NCCL INFO Channel 01 : 2[b0] -> 3[c0] via P2P/IPC
VM-0-71-centos:17749:18574 [2] NCCL INFO Channel 02 : 2[b0] -> 3[c0] via P2P/IPC
VM-0-71-centos:17749:18574 [2] NCCL INFO Channel 03 : 2[b0] -> 3[c0] via P2P/IPC
VM-0-71-centos:17749:18574 [2] NCCL INFO Channel 05 : 2[b0] -> 3[c0] via P2P/IPC
VM-0-71-centos:17748:18575 [1] NCCL INFO Channel 00 : 1[a0] -> 3[c0] via P2P/IPC
VM-0-71-centos:17747:18573 [0] NCCL INFO Channel 02 : 0[90] -> 2[b0] via P2P/IPC
VM-0-71-centos:17749:18574 [2] NCCL INFO Channel 06 : 2[b0] -> 3[c0] via P2P/IPC
VM-0-71-centos:17748:18575 [1] NCCL INFO Channel 03 : 1[a0] -> 3[c0] via P2P/IPC
VM-0-71-centos:17747:18573 [0] NCCL INFO Channel 03 : 0[90] -> 2[b0] via P2P/IPC
VM-0-71-centos:17749:18574 [2] NCCL INFO Channel 07 : 2[b0] -> 3[c0] via P2P/IPC
VM-0-71-centos:17748:18575 [1] NCCL INFO Channel 04 : 1[a0] -> 3[c0] via P2P/IPC
VM-0-71-centos:17747:18573 [0] NCCL INFO Channel 06 : 0[90] -> 2[b0] via P2P/IPC
VM-0-71-centos:17748:18575 [1] NCCL INFO Channel 07 : 1[a0] -> 3[c0] via P2P/IPC
VM-0-71-centos:17747:18573 [0] NCCL INFO Channel 07 : 0[90] -> 2[b0] via P2P/IPC
VM-0-71-centos:17749:18574 [2] NCCL INFO Channel 00 : 2[b0] -> 0[90] via P2P/IPC
VM-0-71-centos:17750:18576 [3] NCCL INFO Channel 01 : 3[c0] -> 1[a0] via P2P/IPC
VM-0-71-centos:17749:18574 [2] NCCL INFO Channel 01 : 2[b0] -> 0[90] via P2P/IPC
VM-0-71-centos:17750:18576 [3] NCCL INFO Channel 02 : 3[c0] -> 1[a0] via P2P/IPC
VM-0-71-centos:17749:18574 [2] NCCL INFO Channel 04 : 2[b0] -> 0[90] via P2P/IPC
VM-0-71-centos:17750:18576 [3] NCCL INFO Channel 05 : 3[c0] -> 1[a0] via P2P/IPC
VM-0-71-centos:17749:18574 [2] NCCL INFO Channel 05 : 2[b0] -> 0[90] via P2P/IPC
VM-0-71-centos:17750:18576 [3] NCCL INFO Channel 06 : 3[c0] -> 1[a0] via P2P/IPC
VM-0-71-centos:17747:18573 [0] NCCL INFO Connected all trees
VM-0-71-centos:17747:18573 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
VM-0-71-centos:17747:18573 [0] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
VM-0-71-centos:17748:18575 [1] NCCL INFO Connected all trees
VM-0-71-centos:17748:18575 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
VM-0-71-centos:17748:18575 [1] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
VM-0-71-centos:17750:18576 [3] NCCL INFO Channel 00 : 3[c0] -> 2[b0] via P2P/IPC
VM-0-71-centos:17750:18576 [3] NCCL INFO Channel 01 : 3[c0] -> 2[b0] via P2P/IPC
VM-0-71-centos:17750:18576 [3] NCCL INFO Channel 03 : 3[c0] -> 2[b0] via P2P/IPC
VM-0-71-centos:17750:18576 [3] NCCL INFO Channel 04 : 3[c0] -> 2[b0] via P2P/IPC
VM-0-71-centos:17750:18576 [3] NCCL INFO Channel 05 : 3[c0] -> 2[b0] via P2P/IPC
VM-0-71-centos:17750:18576 [3] NCCL INFO Channel 07 : 3[c0] -> 2[b0] via P2P/IPC
VM-0-71-centos:17749:18574 [2] NCCL INFO Connected all trees
VM-0-71-centos:17749:18574 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
VM-0-71-centos:17749:18574 [2] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
VM-0-71-centos:17750:18576 [3] NCCL INFO Connected all trees
VM-0-71-centos:17750:18576 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
VM-0-71-centos:17750:18576 [3] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
VM-0-71-centos:17748:18575 [1] NCCL INFO comm 0x7fd5500f2ac0 rank 1 nranks 4 cudaDev 1 busId a0 - Init COMPLETE
VM-0-71-centos:17747:18573 [0] NCCL INFO comm 0x7f704c002010 rank 0 nranks 4 cudaDev 0 busId 90 - Init COMPLETE
VM-0-71-centos:17750:18576 [3] NCCL INFO comm 0x7f73201033c0 rank 3 nranks 4 cudaDev 3 busId c0 - Init COMPLETE
VM-0-71-centos:17749:18574 [2] NCCL INFO comm 0x7f5e98102ec0 rank 2 nranks 4 cudaDev 2 busId b0 - Init COMPLETE
VM-0-71-centos:17747:17747 [0] NCCL INFO Launch mode Parallel
[2023-01-04 18:57:18,198] [INFO] [utils.py:827:see_memory_usage] Before initializing optimizer states
[2023-01-04 18:57:18,201] [INFO] [utils.py:832:see_memory_usage] MA 18.79 GB         Max_MA 18.79 GB         CA 18.81 GB         Max_CA 19 GB 
[2023-01-04 18:57:18,201] [INFO] [utils.py:837:see_memory_usage] CPU Virtual Memory:  used = 82.07 GB, percent = 52.4%
Using /root/.cache/torch_extensions/py37_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.24144673347473145 seconds
Using /root/.cache/torch_extensions/py37_cu113 as PyTorch extensions root...
[2023-01-04 19:19:30,067] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 17747
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 4.039730548858643 seconds
[2023-01-04 19:19:47,124] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 17748
[2023-01-04 19:19:47,170] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 17749
[2023-01-04 19:19:51,171] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 17750
[2023-01-04 19:19:54,866] [ERROR] [launch.py:292:sigkill_handler] ['/data/qin/miniconda3/envs/bmb_env/bin/python', '-u', 'finetune_glm.py', '--local_rank=3', '--deepspeed', '--deepspeed_config', 'config_tasks/config_blocklm_10B_cnndm.json', '--finetune', '--experiment-name', 'GLM-10B-chinese-customization_01-04-18-54', '--task', 'customization', '--data-dir', 'data/customization', '--save', 'ckpt/debug_/finetune_checkpoints', '--checkpoint-activations', '--num-workers', '1', '--no-load-lr-scheduler', '--block-lm', '--cloze-eval', '--task-mask', '--num-layers', '48', '--hidden-size', '4096', '--num-attention-heads', '64', '--max-position-embeddings', '1024', '--tokenizer-type', 'ChineseSPTokenizer', '--load-pretrained', '/data/qst/code/GLM/ckpt/glm-10b-chinese', '--epochs', '10', '--lr', '1e-5', '--lr-decay-style', 'linear', '--warmup', '0.06', '--label-smoothing', '0.1', '--save-interval', '10000', '--log-interval', '50', '--eval-interval', '1000', '--eval-iters', '100', '--eval-epoch', '2', '--src-seq-length', '512', '--tgt-seq-length', '128', '--min-tgt-length', '55', '--length-penalty', '0.7', '--no-repeat-ngram-size', '3', '--num-beams', '5', '--select-topk', '--eval-batch-size', '1', '--fp16', '--model-parallel-size', '1', '--overwrite'] exits with return code = -9

duzx16 commented 1 year ago

The process is killed by the operation system. Maybe your CPU memory (not GPU memory) is not enough to accommodate the optimizer state when enabling CPU offload.

Ant0082 commented 1 year ago

The process is killed by the operation system. Maybe your CPU memory (not GPU memory) is not enough to accommodate the optimizer state when enabling CPU offload.

It works with stage 3, but the following error is reported:

size mismatch for transformer.layers.44.post_attention_layernorm.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for transformer.layers.44.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([16384, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for transformer.layers.44.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([16384]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for transformer.layers.44.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([4096, 16384]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for transformer.layers.44.mlp.dense_4h_to_h.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for transformer.layers.45.input_layernorm.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for transformer.layers.45.input_layernorm.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for transformer.layers.45.attention.query_key_value.weight: copying a param with shape torch.Size([12288, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for transformer.layers.45.attention.query_key_value.bias: copying a param with shape torch.Size([12288]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for transformer.layers.45.attention.dense.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for transformer.layers.45.attention.dense.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for transformer.layers.45.post_attention_layernorm.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for transformer.layers.45.post_attention_layernorm.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for transformer.layers.45.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([16384, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for transformer.layers.45.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([16384]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for transformer.layers.45.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([4096, 16384]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for transformer.layers.45.mlp.dense_4h_to_h.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for transformer.layers.46.input_layernorm.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for transformer.layers.46.input_layernorm.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for transformer.layers.46.attention.query_key_value.weight: copying a param with shape torch.Size([12288, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for transformer.layers.46.attention.query_key_value.bias: copying a param with shape torch.Size([12288]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for transformer.layers.46.attention.dense.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for transformer.layers.46.attention.dense.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for transformer.layers.46.post_attention_layernorm.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for transformer.layers.46.post_attention_layernorm.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for transformer.layers.46.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([16384, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for transformer.layers.46.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([16384]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for transformer.layers.46.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([4096, 16384]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for transformer.layers.46.mlp.dense_4h_to_h.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for transformer.layers.47.input_layernorm.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for transformer.layers.47.input_layernorm.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for transformer.layers.47.attention.query_key_value.weight: copying a param with shape torch.Size([12288, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for transformer.layers.47.attention.query_key_value.bias: copying a param with shape torch.Size([12288]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for transformer.layers.47.attention.dense.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for transformer.layers.47.attention.dense.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for transformer.layers.47.post_attention_layernorm.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for transformer.layers.47.post_attention_layernorm.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for transformer.layers.47.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([16384, 4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for transformer.layers.47.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([16384]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for transformer.layers.47.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([4096, 16384]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for transformer.layers.47.mlp.dense_4h_to_h.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for transformer.final_layernorm.weight: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
        size mismatch for transformer.final_layernorm.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([0]).
[2023-01-05 11:31:11,109] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 36256
[2023-01-05 11:31:15,802] [ERROR] [launch.py:292:sigkill_handler] ['/data/qin/miniconda3/envs/bmb_env/bin/python', '-u', 'finetune_glm.py', '--local_rank=3', '--deepspeed', '--deepspeed_config', 'config_tasks/config_blocklm_10B_cnndm.json', '--finetune', '--experiment-name', 'GLM-10B-chinese-customization_01-05-11-18', '--task', 'customization', '--data-dir', 'data/customization', '--save', 'ckpt/debug_/finetune_checkpoints', '--checkpoint-activations', '--num-workers', '1', '--no-load-lr-scheduler', '--block-lm', '--cloze-eval', '--task-mask', '--num-layers', '48', '--hidden-size', '4096', '--num-attention-heads', '64', '--max-position-embeddings', '1024', '--tokenizer-type', 'ChineseSPTokenizer', '--load-pretrained', '/data/qst/code/GLM/ckpt/glm-10b-chinese', '--epochs', '10', '--lr', '1e-5', '--lr-decay-style', 'linear', '--warmup', '0.06', '--label-smoothing', '0.1', '--save-interval', '10000', '--log-interval', '50', '--eval-interval', '1000', '--eval-iters', '100', '--eval-epoch', '2', '--src-seq-length', '512', '--tgt-seq-length', '128', '--min-tgt-length', '55', '--length-penalty', '0.7', '--no-repeat-ngram-size', '3', '--num-beams', '5', '--select-topk', '--eval-batch-size', '1', '--fp16', '--model-parallel-size', '1', '--overwrite'] exits with return code = 1

duzx16 commented 1 year ago

the shape in current model is torch.Size([0]).

This is because the model weights are partitioned across multiple GPUs with stage 3 and the state_dict contains just the placeholders. I cannot find any instructions about how to load checkpoints saved without stage 3. Maybe you can ask the question in the DeepSpeed repo

ouyangliqi commented 1 year ago

I got the same question, @Ant0082 May I ask how do you solve this question?

Ant0082 commented 1 year ago

I got the same question, @Ant0082 May I ask how do you solve this question?

Run the code with stage 2 config. And https://item.jd.com/100038704859.html

ouyangliqi commented 1 year ago

oh, thanks...

THUDM / GLM

Train the glm-10B-chinese model using 4 V100 GPUs, with no error logs printed, and then exit #60