Cannot replicate the results!

leehomyc commented 4 years ago

I tried to replicate the Xceptionnet results but failed. I strictly followed the data pre-processing and training. However, it turns out the log loss on the DFDC public test set is 0.4 at best after trying multiple runs. While I used the pre-trained xceptionnet the test loss is 0.3. I have several questions:

It takes us a day to train each epoch. How to train 20 epochs within a day?
It saves the best model after running validation at each epoch. However, training one epoch takes a long time and running validation only once per epoch may not be enough. I found that the model overfits quickly and best validation most is not really the best test model (ckpt-1 may be better).
For xceptionnet, any reason it does not use imagenet as pre-train?
For the same setting, does the test loss differ a lot between different runs?

cuihaoleo commented 4 years ago

Is 0.3 logloss the exact value you got? That is even better than our Kaggle record without WS-DAN (0.3250). Cannot give specific suggestion without further evidence. We will try the xception replication in our environment and see what will happen.

It takes us a day to train each epoch. How to train 20 epochs within a day?

Are you talking about the Xception code or the WS-DAN code? The Xception code should not be that slow. It only samples around 10% frames in one epoch to save time and validate more often. Could it be your IO too slow?

It saves the best model after running validation at each epoch. However, training one epoch takes a long time and running validation only once per epoch may not be enough. I found that the model overfits quickly and best validation most is not really the best test model (ckpt-1 may be better).

As above mentioned, the code only samples around 10% frames in one epoch so validation is more frequent. And it is very possible that best validation is not the best test model, which almost every DFDC team suffered.

For xceptionnet, any reason it does not use imagenet as pre-train?

I checked with our members. The xception model did use imagenet pretrained weight to initialize (from: https://github.com/Cadene/pretrained-models.pytorch). Sorry, it is not reflected in the code. I will update it later.

For the same setting, does the test loss differ a lot between different runs?

Randomness in augmentation (and other parts) could impact the result. But intuitively we don't think it affect much.

leehomyc commented 4 years ago

Thanks for your answer. It clears up a lot of the questions in my mind. I have one additional question, in your csv file, the number of frames sometimes do not match with the actual frames of the video. How did you determine the number of frames in a video?

leehomyc commented 4 years ago

Also do you notice which epoch is usually best.pth? I my case it is ckpt-5.pth, should I train longer or it is the case that it converges at epoch 5?

leehomyc commented 4 years ago

Sorry, I should make it clear that 0.4 is our xceptionet result, but I take your model from Google drive, the xceptionet result is 0.3. Also, I changed to xceptionnet pre-trained, and it does not make much a difference for the final log loss, which is still 0.4+. I did not change any of your code, so I am not sure what went wrong. I notice that you set to a random image when such frame does not exist. Does it affect the final results?

leehomyc commented 4 years ago

Hi, another question i wanna ask is, I assume xception-hg-2.pth is the best.pth you saved when running train-xceptiopn.py? However, I found that the test log loss of ckpt-1.pth is usually smaller than best.pth, although they are both 0.4+. For wsdan, could you let me know how you saved ckpt_x.pth and ckpt_e.pth? The code seems does not tell this at all.

cuihaoleo / kaggle-dfdc

Cannot replicate the results! #8