bytedance / lightseq

LightSeq: A High Performance Library for Sequence Processing and Generation
Other
3.22k stars 329 forks source link

新增的Vit启动脚本没有用torch.distributed.launch,运行会报错 #312

Open h2bit opened 2 years ago

h2bit commented 2 years ago

新增的Vit例子使用提供的脚本运行到开始训练时会报错:

 ……
  File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/PycharmProjects/lightseq/examples/training/huggingface/vit/ls_hf_vit_encoder_layer.py", line 13, in forward
    output = super().forward(hidden_states, ls_encoder_padding_mask)
  File "/root/anaconda3/lib/python3.8/site-packages/lightseq/training/ops/pytorch/transformer_encoder_layer.py", line 248, in forward
    (encoder_padding_mask * -1e8).type_as(hidden_states).contiguous()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

  0%|                                                                                     | 0/325 [00:01<?, ?it/s]

原因在于,新增的Vit启动脚本没有用torch.distributed.launch https://github.com/bytedance/lightseq/blob/4024ae14b90d5eb9f50cb45addf258ebcd283b6c/examples/training/huggingface/vit/run_vit.sh#L18

应该改成

python3 -m torch.distributed.launch \
    --nproc_per_node=2 \
    $THIS_DIR/run_vit.py \

就可以正常完成训练

100%|███████████████████████████████████████████████████████████████████████████| 325/325 [00:47<00:00,  6.83it/s][INFO|trainer.py:2166] 2022-05-26 20:16:44,167 >> Saving model checkpoint to /tmp/beans_outputs
[INFO|configuration_utils.py:441] 2022-05-26 20:16:44,167 >> Configuration saved in /tmp/beans_outputs/config.json[INFO|modeling_utils.py:1378] 2022-05-26 20:16:44,382 >> Model weights saved in /tmp/beans_outputs/pytorch_model.bin
[INFO|feature_extraction_utils.py:351] 2022-05-26 20:16:44,382 >> Feature extractor saved in /tmp/beans_outputs/preprocessor_config.json
***** train metrics *****
  epoch                    =        5.0
  train_loss               =     0.2952
  train_runtime            = 0:00:47.57
  train_samples_per_second =    108.662
  train_steps_per_second   =      6.831
[INFO|trainer.py:2416] 2022-05-26 20:16:44,383 >> ***** Running Evaluation *****
[INFO|trainer.py:2418] 2022-05-26 20:16:44,383 >>   Num examples = 133
[INFO|trainer.py:2421] 2022-05-26 20:16:44,383 >>   Batch size = 8
100%|███████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 18.69it/s]***** eval metrics *****
  epoch                   =        5.0
  eval_accuracy           =     0.9774
  eval_loss               =     0.1272
  eval_runtime            = 0:00:00.52
  eval_samples_per_second =    254.382
  eval_steps_per_second   =     17.214
neopro12 commented 2 years ago

Thanks, we will provide distributed version for our training example. Currently, we have not considered multi-node training