reproducing error on MAE

Hi,

Thanks for your excellent work.

When I tried to reproduce the fine-tuning result of MAE-ViT-L on the 8-GPU machine, I faced this problem.

File "/home/ywlee/SlowFast/slowfast/models/video_model_builder.py", line 1215, in forward ) + torch.repeat_interleave( RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

My command is :

python tools/run_net.py --cfg configs/masked_ssl/k400_VIT_L_16x4_FT.yaml DATA.DECODING_BACKEND pyav DATA.PATH_TO_DATA_DIR: "/home/ywlee/data/kinetics400" TRAIN.CHECKPOINT_FILE_PATH ./VIT_L_16x4_MAE_PT.pyth OUTPUT_DIR ./output/k40_mae_vit_large_ft

I changed the only DATA.DECODING_BACKEND with pyav because torchvision results in an error.

My environment:

python==3.8 torch==1.12.0 torchvision==0.11.1

From this environment, I succeeded to train k400_VIT_B_16x4_MAE_PT.yaml. But, when I tried to fine-tune k400_VIT_B_16x4_FT.yaml, the above same error occurred.

facebookresearch / SlowFast

reproducing error on MAE #606