facebookresearch / XLM

PyTorch original implementation of Cross-lingual Language Model Pretraining.
Other
2.87k stars 495 forks source link

cannot find checkpoint to reload in multi-gpu pretraining #327

Closed colmantse closed 3 years ago

colmantse commented 3 years ago

Hi, i read #51 but i am still confused on how to restore model checkpoints from aborted training. I am using 7 gpus and in the dump_path, I got a myriads of temp files as follow:

~/XLM$ ls dumped/backup 0bdfwk713k 1d2gb4n8iw 316z0yi763 4tnm7q2uvd 5cz6kdk8md bgk9pq0a0d bsuuhkulbx dxylz24367 ffu9xmrrci g3dnomej53 nvwyn81s5r r5ekukbmxe w3iaqw82iu zqavxaw25a inside which looks like this: ~/XLM$ ls dumped/backup/0bdfwk713k params.pkl train.log-6

i noticed that there should be a dump_path/checkpoint.pth file to allow the restoring of checkpoint but that doesnt seem to exist anywhere in my dump folder.

I am sure that my model has trained for at least some 50 epoch so it got to be stored somewhere. Would be great if someone could help me navigate.