facebookarchive / fb.resnet.torch

Torch implementation of ResNet from http://arxiv.org/abs/1512.03385 and training scripts
Other
2.29k stars 664 forks source link

After 2 training in cifar10 with depth20 and depth 110 ended with killed in my terminal #140

Open andyfan0618 opened 7 years ago

andyfan0618 commented 7 years ago

cifar10-depth20-batchsize64 cifar10-depth110-batchsize64 已砍掉 = killed

erogol commented 7 years ago

two possibilities I think of, one is your GPU mmory is not enough, the second is if you use latest torch with all new libraries, I guess there is a memory leak happening in checkpoints.lua . Try running the code without checkpointing in main.lua

bearpaw commented 7 years ago

Same problem here.

timh20022002 commented 7 years ago

Same problem here.

andyfan0618 commented 7 years ago

@erogol Thanks for your help! After running without checkpointing in main.lua it complete without 'killed' So is it because of memory leak in the function 'save' inside checkpoin.lua?

colesbury commented 7 years ago

@andyfan0618 there was a memory leak in Torch that got triggered by the checkpointing code. If you update or reinstall Torch it should be fixed