Closed mueller-mp closed 11 months ago
I had the same problem with you.
In my case, when my python version was 3.8, this problem occurred. After changing python version to 3.7, I can solve this problem.
I hope you can fix this issue.
Thanks a lot, this resolved it!
Hi and thanks a lot for this codebase!
I'm trying to reproduce your results with a Deit-small, but training is very slow, with more than 2h/epoch. I have tried the deit commands from the readme, both with submitit on a slurm cluster and without submitit on a VM:
python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py --model deit_small_patch16_224 --batch-size 256 --data-path /path/to/imagenet --output_dir /path/to/save
python run_with_submitit.py --model deit_small_patch16_224 --data-path /path/to/imagenet --partition a100 --ngpus 4 --nodes 1 --batch-size 256
I also tried the Deit-III command:
python run_with_submitit.py --model deit_small_patch16_LS --data-path /path/to/imagenet --batch 256 --lr 4e-3 --epochs 800 --weight-decay 0.05 --sched cosine --input-size 224 --eval-crop-ratio 1.0 --reprob 0.0 --nodes 1 --ngpus 8 --smoothing 0.0 --warmup-epochs 5 --drop 0.0 --nb-classes 1000 --seed 0 --opt fusedlamb --warmup-lr 1e-6 --mixup .8 --drop-path 0.05 --cutmix 1.0 --unscale-lr --repeated-aug --bce-loss --color-jitter 0.3 --ThreeAugment
and I played with the number of workers, but no change. The bottleneck is dataloading: If I use the same batch for every forward-backward pass instead of iterating over the dataloader, I get speeds of few minutes per epoch. On the contrary, if I only iterate over the dataloader (without performing forward-backward passes) I am back to more than 2h/epoch. I also tried running the timm train script with the same model and similar setups (batchsize, number of GPUs, ...) , and there I get training times of a few minutes / epoch, as it should be.
Do you have any idea why this happens? And are the commands and code you provided exactly those that led to the roughly 3 days runtime for a deit-small, as reported in the paper? I work on an A100 machine with 8 GPUs and 64 CPUs.
Thanks a lot!