Oneflow-Inc / libai

LiBai(李白): A Toolbox for Large-Scale Distributed Parallel Training
https://libai.readthedocs.io
Apache License 2.0
391 stars 55 forks source link

Add Support for GPT-2 Training on different Devices #551

Closed ShawnXuan closed 1 month ago

ShawnXuan commented 2 months ago

Getting Started

Prepare the Data and Vocabulary

  1. Vocabulary JSON
  2. Merges File
  3. Binary Data File
  4. Index Data File
$ tree data
path/to/gpt_data
├── gpt2-vocab.json
├── gpt2-merges.txt
├── loss_compara_content_sentence.bin
└── loss_compara_content_sentence.idx

How to Train gpt2 Model with NPU/XPU

python3 -m oneflow.distributed.launch \
    --nproc_per_node 1 \
    --nnodes 1 \
    --node_rank 0 \
    --master_addr 127.0.0.1 \
    --master_port 12345 \
        tools/train_net.py --config-file=configs/gpt2_pretrain.py \
            graph.enabled=False \
            train.input_placement_device="npu" \
            train.dist.device_type="npu" \
            train.amp.enabled=False \
            model.cfg.scale_mask_softmax_fusion=False \
            model.cfg.bias_gelu_fusion=False

If you want to train on XPU, please change 'npu' to 'xpu'.

ShawnXuan commented 2 months ago

NPU(910B3)

[09/10 10:37:57 libai]: >>> done with building model. Building time: 0.282 seconds
WARNING [09/10 10:37:57 lb.scheduler.lr_scheduler]: warmup iters equals to zero, return CosineLR
[09/10 10:38:03 lb.engine.trainer]: Starting training from iteration 0
[09/10 10:40:56 lb.utils.events]:  eta: 21:00:38  iteration: 19/10000  consumed_samples: 80  total_loss: 9.895  time: 7.5187 s/iter  data_time: 0.0021 s/iter total_throughput: 0.53 samples/s lr: 1.50e-04
[09/10 10:43:32 lb.utils.events]:  eta: 21:05:47  iteration: 39/10000  consumed_samples: 160  total_loss: 9.027  time: 7.6572 s/iter  data_time: 0.0019 s/iter total_throughput: 0.52 samples/s lr: 1.50e-04
[09/10 10:46:05 lb.utils.events]:  eta: 21:06:05  iteration: 59/10000  consumed_samples: 240  total_loss: 8.362  time: 7.6549 s/iter  data_time: 0.0015 s/iter total_throughput: 0.52 samples/s lr: 1.50e-04
[09/10 10:48:42 lb.utils.events]:  eta: 21:08:55  iteration: 79/10000  consumed_samples: 320  total_loss: 7.847  time: 7.7127 s/iter  data_time: 0.0013 s/iter total_throughput: 0.52 samples/s lr: 1.50e-04
[09/10 10:51:22 lb.utils.events]:  eta: 21:18:52  iteration: 99/10000  consumed_samples: 400  total_loss: 7.628  time: 7.7640 s/iter  data_time: 0.0013 s/iter total_throughput: 0.52 samples/s lr: 1.50e-04
[09/10 10:53:53 lb.utils.events]:  eta: 21:04:10  iteration: 119/10000  consumed_samples: 480  total_loss: 7.441  time: 7.7314 s/iter  data_time: 0.0013 s/iter total_throughput: 0.52 samples/s lr: 1.50e-04

CUDA(A100)

[09/10 10:50:47 libai]: >>> done with building model. Building time: 5.722 seconds
WARNING [09/10 10:50:47 lb.scheduler.lr_scheduler]: warmup iters equals to zero, return CosineLR
[09/10 10:50:50 lb.engine.trainer]: Starting training from iteration 0
[09/10 10:50:54 lb.utils.events]:  eta: 0:10:15  iteration: 19/10000  consumed_samples: 80  total_loss: 9.83  time: 0.0689 s/iter  data_time: 0.0008 s/iter total_throughput: 58.05 samples/s lr: 1.50e-04
[09/10 10:50:58 lb.utils.events]:  eta: 0:10:15  iteration: 39/10000  consumed_samples: 160  total_loss: 9.122  time: 0.1458 s/iter  data_time: 0.0007 s/iter total_throughput: 27.43 samples/s lr: 1.50e-04
[09/10 10:51:00 lb.utils.events]:  eta: 0:10:12  iteration: 59/10000  consumed_samples: 240  total_loss: 8.388  time: 0.1214 s/iter  data_time: 0.0007 s/iter total_throughput: 32.94 samples/s lr: 1.50e-04
[09/10 10:51:03 lb.utils.events]:  eta: 0:10:11  iteration: 79/10000  consumed_samples: 320  total_loss: 8.019  time: 0.1357 s/iter  data_time: 0.0008 s/iter total_throughput: 29.48 samples/s lr: 1.50e-04
[09/10 10:51:05 lb.utils.events]:  eta: 0:10:09  iteration: 99/10000  consumed_samples: 400  total_loss: 7.635  time: 0.1232 s/iter  data_time: 0.0008 s/iter total_throughput: 32.47 samples/s lr: 1.50e-04
[09/10 10:51:06 lb.utils.events]:  eta: 0:10:09  iteration: 119/10000  consumed_samples: 480  total_loss: 7.461  time: 0.1132 s/iter  data_time: 0.0008 s/iter total_throughput: 35.34 samples/s lr: 1.50e-04
[09/10 10:51:08 lb.utils.events]:  eta: 0:10:09  iteration: 139/10000  consumed_samples: 560  total_loss: 7.367  time: 0.1061 s/iter  data_time: 0.0009 s/iter total_throughput: 37.72 samples/s lr: 1.50e-04
[09/10 10:51:09 lb.utils.events]:  eta: 0:10:06  iteration: 159/10000  consumed_samples: 640  total_loss: 7.305  time: 0.1003 s/iter  data_time: 0.0008 s/iter total_throughput: 39.88 samples/s lr: 1.50e-04
[09/10 10:51:10 lb.utils.events]:  eta: 0:10:04  iteration: 179/10000  consumed_samples: 720  total_loss: 7.214  time: 0.0975 s/iter  data_time: 0.0008 s/iter total_throughput: 41.02 samples/s lr: 1.50e-04
[09/10 10:51:12 lb.utils.events]:  eta: 0:10:03  iteration: 199/10000  consumed_samples: 800  total_loss: 7.132  time: 0.0940 s/iter  data_time: 0.0007 s/iter total_throughput: 42.55 samples/s lr: 1.50e-04
[09/10 10:51:13 lb.utils.events]:  eta: 0:10:02  iteration: 219/10000  consumed_samples: 880  total_loss: 6.986  time: 0.0911 s/iter  data_time: 0.0008 s/iter total_throughput: 43.93 samples/s lr: 1.50e-04
[09/10 10:51:14 lb.utils.events]:  eta: 0:10:01  iteration: 239/10000  consumed_samples: 960  total_loss: 6.866  time: 0.0886 s/iter  data_time: 0.0009 s/iter total_throughput: 45.15 samples/s lr: 1.50e-04
[09/10 10:51:18 lb.utils.events]:  eta: 0:10:00  iteration: 259/10000  consumed_samples: 1040  total_loss: 6.764  time: 0.0958 s/iter  data_time: 0.0008 s/iter total_throughput: 41.74 samples/s lr: 1.50e-04
[09/10 10:51:19 lb.utils.events]:  eta: 0:09:58  iteration: 279/10000  consumed_samples: 1120  total_loss: 6.655  time: 0.0933 s/iter  data_time: 0.0008 s/iter total_throughput: 42.85 samples/s lr: 1.50e-04