facebookarchive / fb.resnet.torch

Torch implementation of ResNet from http://arxiv.org/abs/1512.03385 and training scripts
Other
2.29k stars 664 forks source link

TestOnly on different machines give different results #202

Closed YotYot closed 6 years ago

YotYot commented 6 years ago

Hi,

I was fine tuning a resnet-18 net on my own dataset and got to good val results on an AWS machine. I then copied all the fb.resnet.lua directory, including the saved checkpoints, and my dataset to a different machine. I then ran again, on this different machine, with "-resume ./checkpoint" and saw that indeed the best checkpoint is loaded - but couldn't get even near the val results I got before - it was actually like the weights were reset.
I tried comparing also the luarocks installed rocks - but all the same. Any idea what could cause this?

Thanks! Yotam

YotYot commented 6 years ago

Found the issue - Mismatch in cudnn versions. Apparently some precision differences between v4 and v5.