……
File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/root/PycharmProjects/lightseq/examples/training/huggingface/vit/ls_hf_vit_encoder_layer.py", line 13, in forward
output = super().forward(hidden_states, ls_encoder_padding_mask)
File "/root/anaconda3/lib/python3.8/site-packages/lightseq/training/ops/pytorch/transformer_encoder_layer.py", line 248, in forward
(encoder_padding_mask * -1e8).type_as(hidden_states).contiguous()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
0%| | 0/325 [00:01<?, ?it/s]
新增的Vit例子使用提供的脚本运行到开始训练时会报错:
原因在于,新增的Vit启动脚本没有用torch.distributed.launch https://github.com/bytedance/lightseq/blob/4024ae14b90d5eb9f50cb45addf258ebcd283b6c/examples/training/huggingface/vit/run_vit.sh#L18
应该改成
就可以正常完成训练