facebookresearch / dora

Dora is an experiment management framework. It expresses grid searches as pure python files as part of your repo. It identifies experiments with a unique hash signature. Scale up to hundreds of experiments without losing your sanity.
MIT License
269 stars 24 forks source link

[Feature request] Allow running {first-xp or entirety} of a grid, locally. #8

Closed robert-verkuil closed 3 years ago

robert-verkuil commented 3 years ago

I've been using Dora recently and it's been great. One thing that one help my usage is an easy way to run e.g. the first xp of a grid locally, for debugging purposes. This would be helpful for large, complex sweeps, to quickly squash issues without waiting for xps to schedule.

(So far, as a workaround, I've been printing launcher._argv and then doing dora run ${launcher._argv}.)

adefossez commented 3 years ago

Hey @robert-verkuil , actually one way to do it is

dora run -d -f{SIGNATURE}

with the signature taken from the first item on the grid. This will be equivalent to what you have been doing manually.

I usually have a unused parameter in my config, that I called dummy, specifically for avoiding collision with the main XPs, so I would do something like

dora run -d -f{SIGNATURE} dummy=debug

and then you can just monitor this XP to see if things goes as planned. You don't need the dummy parameter if you are just going to quickly run the XP and kill it as soon as the main one actually gets scheduled (this will prevent the two overriding each other checkpoints etc).

adefossez commented 3 years ago

And you can use the same trick if an XP has failed in a grid. Just take the signature for it, and run dora run -f{SIG} -d, and it will resume the experiment from the last checkpoint and let you debug it locally. Once you are happy with the fix, you can restarted the failed xp in the grid with dora grid grid_name -r.

adefossez commented 3 years ago

Closing the task, feel free to reopen if you think the solution I offered is not sufficient for your use case :)

robert-verkuil commented 3 years ago

yes, thank you! sorry for not responding. You were really helpful above. I've been:

  1. launching a small grid
  2. following up with -f{SIG} and that's been good!

dummy parameter is a good call as well