facebookresearch / SlowFast

PySlowFast: video understanding codebase from FAIR for reproducing state-of-the-art video models.
Apache License 2.0
6.59k stars 1.21k forks source link

reproducing error on MAE #606

Closed youngwanLEE closed 2 years ago

youngwanLEE commented 2 years ago

Hi,

Thanks for your excellent work.

When I tried to reproduce the fine-tuning result of MAE-ViT-L on the 8-GPU machine, I faced this problem.

File "/home/ywlee/SlowFast/slowfast/models/video_model_builder.py", line 1215, in forward ) + torch.repeat_interleave( RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

My command is :

python tools/run_net.py --cfg configs/masked_ssl/k400_VIT_L_16x4_FT.yaml DATA.DECODING_BACKEND pyav DATA.PATH_TO_DATA_DIR: "/home/ywlee/data/kinetics400" TRAIN.CHECKPOINT_FILE_PATH ./VIT_L_16x4_MAE_PT.pyth OUTPUT_DIR ./output/k40_mae_vit_large_ft

I changed the only DATA.DECODING_BACKEND with pyav because torchvision results in an error.

My environment:

python==3.8 torch==1.12.0 torchvision==0.11.1

From this environment, I succeeded to train k400_VIT_B_16x4_MAE_PT.yaml. But, when I tried to fine-tune k400_VIT_B_16x4_FT.yaml, the above same error occurred.

youngwanLEE commented 2 years ago

This error was solved. That was caused by label order (1~400) not (0~399).