facebookresearch / deit

Official DeiT repository
Apache License 2.0
4.02k stars 552 forks source link

Slow Training #234

Closed mueller-mp closed 11 months ago

mueller-mp commented 1 year ago

Hi and thanks a lot for this codebase!

I'm trying to reproduce your results with a Deit-small, but training is very slow, with more than 2h/epoch. I have tried the deit commands from the readme, both with submitit on a slurm cluster and without submitit on a VM:

I also tried the Deit-III command:

and I played with the number of workers, but no change. The bottleneck is dataloading: If I use the same batch for every forward-backward pass instead of iterating over the dataloader, I get speeds of few minutes per epoch. On the contrary, if I only iterate over the dataloader (without performing forward-backward passes) I am back to more than 2h/epoch. I also tried running the timm train script with the same model and similar setups (batchsize, number of GPUs, ...) , and there I get training times of a few minutes / epoch, as it should be.

Do you have any idea why this happens? And are the commands and code you provided exactly those that led to the roughly 3 days runtime for a deit-small, as reported in the paper? I work on an A100 machine with 8 GPUs and 64 CPUs.

Thanks a lot!

jsleeg98 commented 11 months ago

I had the same problem with you.

In my case, when my python version was 3.8, this problem occurred. After changing python version to 3.7, I can solve this problem.

I hope you can fix this issue.

mueller-mp commented 11 months ago

Thanks a lot, this resolved it!