Epochs chosen different than the paper

akamaster / pytorch_resnet_cifar10

Proper implementation of ResNet-s for CIFAR10/100 in pytorch that matches description of the original paper.

BSD 2-Clause "Simplified" License

1.22k stars 335 forks source link

Epochs chosen different than the paper #6

Closed PabloRR100 closed 5 years ago

PabloRR100 commented 5 years ago

Hi @akamaster,

The train set has 45.000 images. Taking into account that the BS = 128, that would yield to 352 iterations / epochs. In the paper they train the network for 64000 iterations, which results in 181 epochs of training.

Please, let me know if you agree

akamaster commented 5 years ago

Yes, I agree, partially. In this code, there is no TRAIN/VAL split, therefore train set is of 50k images => 390 iterations/epoch with batch-size 128, therefore total iterations required to match paper should be 165 epochs, with milestones at 81, 123, 164. The pretrained networks in reop were generated with total number of 200 epochs with milestones at 100, 150 and 200.

PabloRR100 commented 5 years ago

Thanks for the reply.

Few questions:

If you don't have a validation set, how are you making sure you are not overfitting at some point? I have done a fair implementation as well but my problem is that my training accuracy reach 100% too early (about 100 epochs) and I suspect there is not a lot of room for improvement in the test set, and that's why I could be having a minimum error in the test set of 7%.
Do you have a similar thing for DenseNets? I have coded them to also match the paper suggestions but it seems the feature maps of densenets are so big that they required either a huge machine or a more careful implementation. In fact, paper's authors have another paper for implementations of DenseNets with a repo in PyTorch available here: https://github.com/gpleiss/efficient_densenet_pytorch. However, this still breaks for me.

Thanks!

dwromero commented 5 years ago

Hey, I have the same commentary as Pablo. So, are you using the test set as validation set? I was just looking at the paper and they state the following:

"We start with a learning rate of 0.1, divide it by 10 at 32k and 48k iterations, and terminate training at 64k iterations, which is determined on a 45k/5k train/val split" --> I think this part is indeed lacking on your implementation. I would be happy to add that if you like :)

Cheers, David

kirk86 commented 5 years ago

Most repos I've seen with pretrained models they all overfit the test set. That's another reason why the numbers look so good. At the minimum there's should be a split of train/val/test.

PabloRR100 commented 5 years ago

Hi @kirk86

Saying overfit on the test set does not make sense right? Since the model is not "seeing" (or trying to fit) the test data, it can not overfit it.

Cheers, Pablo

kirk86 commented 5 years ago

Hi @PabloRR100, true the model is not trying to fit the test data directly but think why we use the validation set in the first place? IMHO the validation set is to control the bias/variance tradeoff and based on that you modify your model. Now how exactly you're not overfitting the test set if you use that to modify your model based on the bias/variance tradeoff? Again IMHO the test set should be untouched at all times but exposed only once in the end after the model has been trained to evaluate its generalization capabilities.

akamaster commented 5 years ago

Dear @PabloRR100 and @kirk86, you are both right. However, in current deep learning, even if you do use validation to control bias/variance tradeoff, since everyone publishes better results, it implicitly means optimizing (looking into) over test data. Clearly, if model doesn't improve over test, then no-one would publish it, therefore, whenever something 'better' appears, it is necessarily over fitting the test data.