Trouble reproducing baseline results

gholste commented 2 years ago

Hi there! First of all, I think the paper is really interesting. Second, I appreciate that this is one of the only open-sourced repos I can find that provides training code for the NIH dataset and uses the official train-test split.

I am trying to independently reproduce the DenseNet121 baseline you provide in the paper (over 0.82 AUROC with augmentation), and not really coming close. I am using the exact same data splits as you, but have written my own training and model definition scripts; I am still using pre-trained ImageNet weights provided by torchvision and, as far as I can tell, all the same hyperparameters and preprocessing steps as you have.

I've tried unweighted cross-entropy, class-weighted cross-entropy, heavy augmentation (including elastic deformations, cutout, etc.), light augmentation (e.g., just random crop to 224 and horizontal flip), using 14 classes, using 15 classes (considering "No Finding" to be another output class). No matter what I've tried, I see rather quick convergence and overfitting (best validation loss achieved no later than epoch 7), and the highest validation AUROC I've seen is 0.814. This is considerably lower than the 0.839 validation AUROC you report in Table 2 for BL5-DenseNet121. The absolute best test set results I've achieved with a DenseNet121 architecture is 0.807 AUROC with 8x test-time augmentation.

I'm pretty puzzled by this because I don't think random variation in training or minor implementation differences should cause a >0.015-point drop in AUROC... Of course there are a million potential sources for this difference, but maybe you can help me pinpoint it.

For your final models, did you use 14 or 15 classes? Also, would you be able to provide any sort of training/learning curve showing loss or AUROC vs. # epochs? I am suspicious of how quickly my baseline models are converging, and am wondering how my training trajectories compare to yours.

ekagra-ranjan commented 2 years ago

Hi,

a. Can you provide the following details which might help us narrow the sources of differences:

What optimizer are you using?
What is its learning rate?
What is the weight decay?
What is the config of learning rate scheduler you are using?
What is the batch size?
How many epochs are you training the model for?

b. The models had 14 classes c. We do not have the saved models nor did we save any sort of train/learning curves

gholste commented 2 years ago

Thanks for the quick reply!

1) Adam 2) 1e-4 3) None because in an early experiment I tried 1e-5 and saw no difference in performance (I'm open to reintroducing it though) 4) None again because I tried using "ReduceLROnPlateau" where I'd reduce the learning rate by 10x whenever the validation loss plateaued for 5 epochs 5) 128, which I'm aware is higher than you use. Respectfully, though, I have never experienced or heard of batch size producing such a difference in final performance 6) I've been using early stopping with a patience of 10 epochs. Like I was saying in the original post, in practice, training terminates within 20 epochs or less and the best validation loss is achieved epoch 7 (often earlier) with any hyperparameter setup

In short, I've tried a lot of different things, but perhaps not all together in the exact configuration that you had. I'm happy to re-run my model with hyperparameters more closely mirroring your paper, so let me know! However, I've just never experienced weight decay or batch size to be the difference between a decent baseline (like what I've achieved) and a nearly state-of-the-art model (like yours), even though almost everything else is the same.

If you let me know what exactly you think I should try, I'll be sure to re-run and report back. I appreciate the help.

gholste commented 2 years ago

For thoroughness, here are results I have so far. All models trained on the exact data split you use, torchvision DenseNet121 pretrained on ImageNet defined exactly as you have, trained for at most 20 epochs. For validation, I just resize images to 224x224. For testing, I use 10x test-time augmentation, where I generate 10 augmented versions of each test image (via the same augmentation pipeline for training) and average predictions — I’m aware this is slightly different than just taking 10 crops as you do in the paper.

Resize 256 + Random Crop 224 + Horizontal Flip p=0.5, Adam lr 1e-4 weight decay 1e-5, batch size 128, 14 classes:
- 0.814 test AUROC
- Best validation AUROC was 0.818 achieved at epoch 5
Resize 256 + Random Crop 224 + Horizontal Flip p=0.5, Adam lr 1e-4 weight decay 1e-5, batch size 128, class-weighted loss, label smoothing 0.1, 14 classes:
- 0.81 test AUROC
- Best validation AUROC was 0.822 achieved at epoch 13
Resize 256 + Random Crop 224 + Horizontal Flip p=0.5 + RandomBrightness (0.75, 1.25) + RandomContrast (0.75, 1.25) + RandomRotation (-5°, 5°), Adam lr 1e-4 weight decay 1e-5, batch size 128, class-weighted loss, decay LR by 10x every 5 epochs, 14 classes:
- 0.799 test AUROC
- Best validation AUROC was 0.812 achieved at epoch 9
Resize 256 + Random Crop 224 + Horizontal Flip p=0.5 + RandomBrightness (0.75, 1.25) + RandomContrast (0.75, 1.25) + RandomRotation (-5°, 5°), Adam lr 1e-4 weight decay 1e-5, batch size 128, class-weighted loss, Dropout 0.5 on classifier, 14 classes:
- 0.796 test AUROC
- Highest validation AUROC was 0.812 at epoch 13
Resize 256 + Random Crop 224 + Horizontal Flip p=0.5 + RandomBrightness (0.75, 1.25) + RandomContrast (0.75, 1.25) + RandomRotation (-5°, 5°), Adam lr 1e-4 weight decay 1e-5, batch size 128, class-weighted loss, Dropout 0.5 on classifier, label smoothing 0.1, 14 classes:
- 0.802 test AUROC
- Best validation AUROC was 0.821 at epoch 12
Resize 256 + Random Crop 224 + Horizontal Flip p=0.5, Adam lr 1e-4 weight decay 1e-5, batch size 128, class-weighted loss, Dropout 0.5 on classifier, label smoothing 0.1, 14 classes:
- 0.812 test AUROC
- Best validation AUROC was 0.822 at epoch 12
Resize 256 + Random Crop 224 + Horizontal Flip p=0.5 + RandomContrast (0.75, 1.25) + RandomRotation (-5, 5), Adam lr 1e-4 weight decay 1e-5, batch size 128, class-weighted loss, 14 classes:
- 0.791 test AUROC
- Best validation AUROC was 0.807 at epoch 8

Just by reading your paper as carefully as possible, this last one looks to be the closest to BL5-DenseNet121, and as you can see the performance is considerably lower than the 0.822 AUROC you report.

Of course it’s entirely possible I just have a bug somewhere, but I want to post this for visibility in case others are also having trouble achieving the results that you do. Overall, I’m totally puzzled why none of these reach the validation or test set metrics that you were able to reach despite being so similar to the approach you took.

If you have any suggestions for how to bridge this gap, let me know.

ekagra-ranjan commented 2 years ago

.1 Are you normalizing the input image by subtracting the image with imagenet mean and std dev?

Can you try using the batch size used in the paper? 128 might be too high a number. Smaller batch size lead to noisy grads which might help to escape local minima. You check out this discussion for theoretical and practical experiences of people.

gholste commented 2 years ago

Yes. Specifically, I read in the image as grayscale, perform augmentation, repeat across color channels, min-max normalize intensities to the interval [0, 1], then standardize with ImageNet mean and std. I am using the albumentations library for augmentation instead of torchvision.
I just obtained these results with smaller batch sizes.
- Resize 256 + RandomCrop 224 + RandomHorizontalFlip + RandomContrast (0.75, 1.25) + RandomRotation (-5°, 5°), Adam lr 1e-4 weight decay 1e-5, class-weighted loss, reduce LR 10x every 5 epochs, batch size 16:
- 0.772 test AUROC
- Best validation AUROC was 0.791 at epoch 11

Resize 256 + RandomCrop 224 + RandomHorizontalFlip + RandomContrast (0.75, 1.25) + RandomRotation (-5°, 5°), Adam lr 1e-4 weight decay 1e-5, class-weighted loss, reduce LR 10x every 5 epochs, Dropout 0.5 on classifier, label smoothing 0.1, batch size 32:
- 0.813 test AUROC
- Best validation AUROC was 0.822 at epoch 13
^^ same as above but with batch size 64:
- 0.81 test AUROC
- Best validation AUROC was 0.826 at epoch 13

I'm still at a loss for what is causing this. I suppose I can try using torchvision transforms instead of albumentations but I would be shocked if that's the culprit; I just can't even identify another difference.

To add confusion, this repo was able to reach the metrics that you did, yet their training code doesn't exactly reflect what you did: they use a cosine annealing learning rate scheduler, seemingly a initial learning rate of 5e-4, no class weights on the BCE loss, etc.

I guess I'll keep trying things 🤷

gholste commented 2 years ago

Just finished another run where the only thing I changed was preprocessing: I used torchvision transformations just as you did (virtually copy and pasted your code) instead of albumentations. This, as far as I can tell, is nearly identical to how you trained BL5-DenseNet121... and yet still decidedly not reaching 0.82+ test AUROC.

Resize 256 + RandomCrop 224 + RandomHorizontalFlip + RandomContrast (0.75, 1.25) + RandomRotation (-5°, 5°), Adam lr 1e-4 weight decay 1e-5, reduce LR by 10x every 5 epochs, batch size 16:
- 0.814 test AUROC
- Best validation AUROC was 0.830 at epoch 13

This one seemed most promising since it had the highest validation AUROC I've seen and the highest test AUROC without test-time augmentation (TTA) of 0.813 -- disappointing that 10x TTA only bumped it up to 0.814.

This is so confusing to me, I just can't comprehend what could be causing such a difference in final performance. Is there anything you can think of that I'm missing? Did you use class weights in your loss function?

ekagra-ranjan / AE-CNN

Trouble reproducing baseline results #4