facebookresearch / EGG

EGG: Emergence of lanGuage in Games
MIT License
281 stars 99 forks source link

avoid loading model twice when running on slurm #199

Closed robertodessi closed 3 years ago

robertodessi commented 3 years ago

Description

Avoiding loading a model twice potentially from two different checkpoints.

Motivation and Context

This might be a corner a case but an unwanted model loading happens when all these three conditions are met:

If a train is launched NOT under slurm but with the --preemptable flag is set the overwriting will not happen since EGG will assign a unique name to the checkpoint folder without any existing checkpoints https://github.com/facebookresearch/EGG/blob/5a68e295c31342385d024ecd5cbff0ff69ee69b0/egg/core/distributed.py#L107-L112 and loading from latest will not do anything https://github.com/facebookresearch/EGG/blob/5a68e295c31342385d024ecd5cbff0ff69ee69b0/egg/core/trainers.py#L307-L316

However, launching EGG not under SLURM and with --preemptable set is pointless

How Has This Been Tested?

UTs

robertodessi commented 3 years ago

This should actually be caught with nest here https://github.com/facebookresearch/EGG/blob/master/egg/nest/nest.py#L140 Let's assume users always use nest when launching jobs on slurm and close this