facebookarchive / fb.resnet.torch

Torch implementation of ResNet from http://arxiv.org/abs/1512.03385 and training scripts
Other
2.29k stars 664 forks source link

Issues reproducing results when training from scratch #43

Open yanii opened 8 years ago

yanii commented 8 years ago

I'm attempting to use your code to evaluate a new method (and it's great to have such accessible code for training state of the art models!), but I'm having issues reproducing baseline results with your unmodified code anywhere close to the validation accuracy you have listed for pre-trained models when training from scratch. In particular, when training from scratch ResNet-101, I get:

Finished top1: 22.942 top5: 6.814

and for your more recent ResNet-200 I get:

Finished top1: 21.681 top5: 6.142

The training script calls for each of these: th main.lua -depth 101 -batchSize 256 -nGPU 8 -nThreads 8 -shareGradInput true -data ${IMAGENET_DIR} th main.lua -depth 200 -batchSize 256 -nGPU 8 -nThreads 8 -netType preresnet -shareGradInput true -data ${IMAGENET_DIR}

The ResNet 200 experiment was run on commit a446597550f7e4b15866e07b32c21e63cd45064b

The training log for resnet-200: resnet-200.zip

The machine used for training has 8x Titan X, and nvidia driver 352.63, cuda 7.5 and cudnn 4.0.7

colesbury commented 8 years ago

One thing to try is to recompute the batch norm statistics on a large portion of the training set, which we briefly mentioned in the blog post.

For ResNet-200, this reduced top-5 err a little: (before) Finished top1: 21.644 top5: 5.905 (after) top1: 21.66 top5: 5.79

However, I don't this will fully account for the difference in accuracy you're seeing. I've only trained ResNet-200 once, so I'm not sure how variable the results are.

I ran with basically the same options: -depth 200 -nGPU 8 -nThreads 12 -batchSize 256 -shareGradInput true -resume . -netType preresnet

I think we're also using cuDNN 4.0.7 and driver 352.79.

yanii commented 8 years ago

Thanks! I'll try to recompute the batch norm stats and see what I get. 5.9 vs 6.1 is much closer than what I thought ... so maybe it's just the difference in random init. I'm re-running to get some idea of that variation, but yeah it's quite the compute, so it's difficult to debug if there is an issue! :-)

Do you happen to know which commit the experiment for resnet-200 was run on?

colesbury commented 8 years ago

On c5301063e5550c58b186161b94d3a7387bfbf6b6, but there are no relevant changes between that commit and the one you used.

However, yesterday the model definition changed, so if you pull you'll be training a slightly different model, which fixes some bugs

yanii commented 8 years ago

OK, started training again on the latest commit, with the fixed model. Will let you know when it eventually finishes!

Cysu commented 8 years ago

I also have some problems when trying to reproduce the pre-resnet based on the latest commit.

When using "both_preact" block to increase the channels, both the shortcut and the residual branches share the same pre-(BN+ReLU). As shown in the current code here and it was borrowed from Kaiming's code here.

However, I find that this shared pre-(BN+ReLU) is harmful to the performance. On CIFAR-10, the error rate of preresnet-1202 is about 8.5%. But if I use individual pre-(BN+ReLU) for each of the shortcut and residual branches, the error rate reduces to 5%. On ImageNet, things get much worse that the shared pre-(BN+ReLU) exploded the training error at epoch 23.

I wonder if it is caused by the shareGradInput? When both branches share the pre-BN, during backward stage, its gradInput is first reused by the residual branch of the previous block, which then corrupts the gradInput of the identity branch.

colesbury commented 8 years ago

@Cysu, yes I'm seeing the same thing. shareGradInput is the culprit.

colesbury commented 8 years ago

I think #48 should fix this. I'm going to check that the accuracy looks OK before merging it in

yanii commented 8 years ago

After retraining pre-resnet 200 on imagenet, with #48 I'm now getting an error of top-1: 21.960, top-5: 6.230 (without recomputing batch stats). So it certainly doesn't decrease error for this experiment.

colesbury commented 8 years ago

@yanii, that's about what I got with the updated model definition (top1: 22.050 top5: 6.069). I haven't recomputed batch norm statistics yet.

szagoruyko commented 8 years ago

I am having the same problem, pre-ResNet-101 is quite a bit worse than Sams' numbers. Anyone had success with training it?