facebookresearch / dora

Dora is an experiment management framework. It expresses grid searches as pure python files as part of your repo. It identifies experiments with a unique hash signature. Scale up to hundreds of experiments without losing your sanity.
MIT License
269 stars 24 forks source link

Slurm Configuration #46

Closed temismink closed 1 year ago

temismink commented 1 year ago

❓ Questions

I'm trying to train Demucs on a 4090 from Jupyter notebook. I'm able to initialize the model, and retrieve its parameters from checkpoint, train the solver, and save it again. I'm having trouble running a grid xp search though. Any help would be appreciated.

Below is what I am running, with my own custom main class, and I get this error. I look into the grids directory and there 3909beea is but unable to be accessed. There might be a problem with slurmconf on the gpu but I am not sure. `
run_grid(main = train, explorer = explorer, grid_name = 'home/robertthomas/Documents/Melody-stems/demucs/demucs/grids/mdx.py', slurm = xp.cfg.slurm)

Error:

Grid: Error when trying to load old sheep 3909beea: Could not find experiment with signature 3909beea An error happened when trying to load from /home/robertthomas/Documents/Melody-stems/demucs/outputs/grids/home/robertthomas/Documents/Melody-stems/demucs/demucs/grids/mdx.py/3909beea/job.pkl, this file will be ignored: FileNotFoundError(2, 'No such file or directory') `

tejess commented 4 months ago

How were you able to resolve this issue? I'm training an Encodec model using audiocraft, and I want to resume training, but I am getting this error. (PS I should also mention that it's searching for the wrong signature)