cgnorthcutt / benchmarking-keras-pytorch

🔥 Reproducibly benchmarking Keras and PyTorch models
https://l7.curtisnorthcutt.com/towards-reproducibility-benchmarking-keras-pytorch
Other
367 stars 43 forks source link

PyTorch ResNet50 Validation Accuracy #3

Open ankmathur96 opened 5 years ago

ankmathur96 commented 5 years ago

Hey there!

I came across your project from Jeremy Howard's Twitter. I think it's great to be benchmarking these numbers and keeping them in a single place!

I've tried running your script and ran into some problems that I was hoping you could help diagnose: I ran python imagenet_pytorch_get_predictions.py -m resnet50 -g 0 -b 64 ~/imagenet/ and got

resnet50 completed: 100.00% resnet50: acc1: 0.10%, acc5: 0.27%

I'm using Python 3.7 and PyTorch 1.0.1.post2 and didn't change any of your code except for making the argparse parameter for batch_size to be type=int.

I work pretty regularly with PyTorch and ResNet-50 and was surprised to see the ResNet-50 have only 75.02% validation accuracy. When I use the pretrained ResNet-50 using the code here, I get 76.138% top-1, 92.864% top-5 accuracy. Specifically, I run:

python main.py -a resnet50 -e -b 64 -j 8 --pretrained ~/imagenet/

I'm using CUDA 9.2 and CUDNN version 7.4.1 and running inference on a NVIDIA V100 on a Google Cloud instance using Ubuntu 16.04.

I'm curious what might be going wrong here and why our results are different - to start with, what version of CUDNN/CUDA did your results originate from?

rwightman commented 5 years ago

There are a lot of factors at play for a given result. Pytorch version, CUDA, PIL, etc. Even changing the image scaling between bicubic and bilinear can have a notable impact. I default to bicubic but bilinear works better for some models, likely based on what they were originally trained with.

I have noticed changes in accuracy for many models that I measured over a year ago to now (same weights).

My ResNet50 number with PyTorch 1.0.1.post2 and CUDA 10: Prec@1 75.868, Prec@5 92.872

My old ResNet50 numbers with PyTorch (0.2.0.post1) and CUDA 9.x?: Prec@1 76.130, Prec@5 92.862

A table with some of my old measurements here: https://github.com/rwightman/pytorch-dpn-pretrained

rwightman commented 5 years ago

ResNet50 on PyTorch 1.0.1.post2 and CUDA 10 w/ bilinear instead of bicubic, Prec@1 76.138, Prec@5 92.864 ... matches your numbers @ankmathur96

ankmathur96 commented 5 years ago

Interesting! I should mention that I am using PIL version 5.3.0.post0.

I believe that bilinear is the default in PyTorch transforms (https://github.com/pytorch/vision/blob/master/torchvision/transforms/transforms.py#L182) and it seems this repository is using the default (https://github.com/cgnorthcutt/benchmarking-keras-pytorch/blob/master/imagenet_pytorch_get_predictions.py#L95). It's interesting to note the difference when using bicubic though.

I've also seen variation with different CUDA versions and other setup differences similar to what you're describing. I've seen, for example, a full percentage point drop when using OpenCV's implementation bilinear resizing, as compared to PIL. I was unaware, though, that there could be a full percentage point drop from such setup differences in this kind of more constrained setting (using PyTorch/CUDA/PIL). I found this especially worth highlighting since this repo's evaluation seems to be off by enough that densenet169 performs worse than ResNet-50 in my setup.

Edit: It's worth noting that many such differences due to subtle changes in preprocessing implementations can be eliminated (if need be for a production use case) by fine tuning with a low learning rate for several epochs

rwightman commented 5 years ago

@ankmathur96 yeah, I noticed when I was doing my benchmarking in the past that most of the resnet/densenet models in torchvision were better with the default bilinear, but a number of the ported models, Inception variants, DPN, etc were doing better with bicubic.

Fine-tuning can definitely help with these sorts of issues if/when it matters. It's also worth noting that many of the default pretrained weights can pretty easily be surpassed by around 1% or more using different training schedules and augmentation techniques.

FWIW my densenet169 numbers are very close to this repo and less than my ResNet50 numbers @1 but better @5.

I'm using Pillow-SIMD 5.3.0.post0

cgnorthcutt commented 5 years ago

@ankmathur96 @rwightman Thanks for finding this. I agree its likely a PyTorch version / cuda version incompatibility. Did either of you find a fix? Feel free to send a Pull Request on https://github.com/cgnorthcutt/benchmarking-keras-pytorch/blob/master/imagenet_pytorch_get_predictions.py

ozabluda commented 5 years ago

@ankmathur96

I get 76.138% top-1 accuracy.

@rwightman

My ResNet50 number with PyTorch 1.0.1.post2 and CUDA 10: Prec@1 75.868, Prec@5 92.872 My old ResNet50 numbers with PyTorch (0.2.0.post1) and CUDA 9.x?: Prec@1 76.130, Prec@5 92.862

the difference between 75.868% and 76.130% (0.262% difference) is not statistically significant with only 50,000 validation samples. Standard deviation of Binomial distribution with p=0.76 and n=50,000 is sqrt(.76*(1-.76)/50000)*100=0.19%

ozabluda commented 5 years ago

@ankmathur96

a full percentage point drop when using OpenCV's implementation bilinear resizing, as compared to PIL.

See these 2 URLs for the differences in bilinear resizing across libraries, or even same library same function, different padding options:

https://stackoverflow.com/questions/18104609/interpolating-1-dimensional-array-using-opencv https://stackoverflow.com/questions/43598373/opencv-resize-result-is-wrong

also see https://hackernoon.com/how-tensorflows-tf-image-resize-stole-60-days-of-my-life-aba5eb093f35

TFv2 now follows Pillow, not OpenCV, if there is a difference between the two... https://github.com/tensorflow/tensorflow/issues/6720

...which doesn't seem the case https://github.com/chainer/onnx-chainer/issues/147

ozabluda commented 5 years ago

@calebrob6 Caleb Robinson | How to reproduce ImageNet validation results http://calebrob.com/ml/imagenet/ilsvrc2012/2018/10/22/imagenet-benchmarking.html

For every image in the validation set we need to apply the following process:

  1. Load the image data in a floating point format.
  2. Resize the smallest side of the image to 256 pixels using bicubic interpolation over 4x4 pixel neighborhood (using OpenCVs resize method with the “INTER_CUBIC” interpolation flag). The larger side should be resized to maintain the original aspect ratio of the image.
  3. Crop the central 224x224 window from the resized image.
  4. Save the image in RGB format. [...] All the steps above are shown in the notebooks from the accompanying GitHub repository