ConvNext-base training speed slower than ViT-B

Hi, I am running experiments with ConvNext-Base and ViT-B on imagenet 1k, which the current paper says have about the same throughput/ sec and ~num of parameters. However, my training speed is about 2x slower for ConvNext-B. I am using ConvNext-B code from this repo, and my own Pytorch Lightning dataloading pipeline for both ConvNext-B and ViT-B.

I use AMP, 224x224 for both, and same hparams for pretty much everything else (I am using most of the original paper's hparams for both). For ViT I used batch size 128, and ConvNext I used 256, with scaled learning rates to match the papers. I am using 8 A100s, and ConvNext is about 15.5 mins/epoch, while ViT about 8 mins/epoch. I'm trying to think of any reason for the discrepancy.

Some other things I tried:

I also tried cudnn.benchmark = True but it didn't change anything.

I tried just a single gpu (T4 local), and the exact same batch sizes for both models, and I get a 1.7x slower speed difference for ConvNext.

Did anybody else experience something similar?

Thanks!

facebookresearch / ConvNeXt

ConvNext-base training speed slower than ViT-B #82