Open yanii opened 8 years ago
One thing to try is to recompute the batch norm statistics on a large portion of the training set, which we briefly mentioned in the blog post.
For ResNet-200, this reduced top-5 err a little: (before) Finished top1: 21.644 top5: 5.905 (after) top1: 21.66 top5: 5.79
However, I don't this will fully account for the difference in accuracy you're seeing. I've only trained ResNet-200 once, so I'm not sure how variable the results are.
I ran with basically the same options:
-depth 200 -nGPU 8 -nThreads 12 -batchSize 256 -shareGradInput true -resume . -netType preresnet
I think we're also using cuDNN 4.0.7 and driver 352.79.
Thanks! I'll try to recompute the batch norm stats and see what I get. 5.9 vs 6.1 is much closer than what I thought ... so maybe it's just the difference in random init. I'm re-running to get some idea of that variation, but yeah it's quite the compute, so it's difficult to debug if there is an issue! :-)
Do you happen to know which commit the experiment for resnet-200 was run on?
On c5301063e5550c58b186161b94d3a7387bfbf6b6, but there are no relevant changes between that commit and the one you used.
However, yesterday the model definition changed, so if you pull you'll be training a slightly different model, which fixes some bugs
OK, started training again on the latest commit, with the fixed model. Will let you know when it eventually finishes!
I also have some problems when trying to reproduce the pre-resnet based on the latest commit.
When using "both_preact" block to increase the channels, both the shortcut and the residual branches share the same pre-(BN+ReLU). As shown in the current code here and it was borrowed from Kaiming's code here.
However, I find that this shared pre-(BN+ReLU) is harmful to the performance. On CIFAR-10, the error rate of preresnet-1202 is about 8.5%. But if I use individual pre-(BN+ReLU) for each of the shortcut and residual branches, the error rate reduces to 5%. On ImageNet, things get much worse that the shared pre-(BN+ReLU) exploded the training error at epoch 23.
I wonder if it is caused by the shareGradInput
? When both branches share the pre-BN, during backward stage, its gradInput is first reused by the residual branch of the previous block, which then corrupts the gradInput of the identity branch.
@Cysu, yes I'm seeing the same thing. shareGradInput
is the culprit.
I think #48 should fix this. I'm going to check that the accuracy looks OK before merging it in
After retraining pre-resnet 200 on imagenet, with #48 I'm now getting an error of top-1: 21.960, top-5: 6.230 (without recomputing batch stats). So it certainly doesn't decrease error for this experiment.
@yanii, that's about what I got with the updated model definition (top1: 22.050 top5: 6.069). I haven't recomputed batch norm statistics yet.
I am having the same problem, pre-ResNet-101 is quite a bit worse than Sams' numbers. Anyone had success with training it?
I'm attempting to use your code to evaluate a new method (and it's great to have such accessible code for training state of the art models!), but I'm having issues reproducing baseline results with your unmodified code anywhere close to the validation accuracy you have listed for pre-trained models when training from scratch. In particular, when training from scratch ResNet-101, I get:
Finished top1: 22.942 top5: 6.814
and for your more recent ResNet-200 I get:
Finished top1: 21.681 top5: 6.142
The training script calls for each of these: th main.lua -depth 101 -batchSize 256 -nGPU 8 -nThreads 8 -shareGradInput true -data ${IMAGENET_DIR} th main.lua -depth 200 -batchSize 256 -nGPU 8 -nThreads 8 -netType preresnet -shareGradInput true -data ${IMAGENET_DIR}
The ResNet 200 experiment was run on commit a446597550f7e4b15866e07b32c21e63cd45064b
The training log for resnet-200: resnet-200.zip
The machine used for training has 8x Titan X, and nvidia driver 352.63, cuda 7.5 and cudnn 4.0.7