🤖 This is the current blocker for a full hyperparameter search.
Currently have to write the manuscript, so unable to debug in detail. We need a permanent solution shared across projects, therefore the issue here.
Is this related to the latest internet shutdown? Maybe so, but we already open the URLs I could find from wandb https://api.wandb.ai.
Train single model - seemed to work fine
Run full model training using the train_models_in_parallel script
It only runs exactly 10 models, one for each lookahead/model architecture combination
I.e. it doesn't start more than one model for each combination, even though we get
This implies there's something wrong with our hyperparameter search.
Running the command directly from the command line
Seems that this does not sync correctly. Stuck on the last line.
Suggests that it's a wandb/hydra interaction that's causing the problem, but only when using --multirun
Testing without multirun
Huh, same problem!
Testing again, but just a simple "train model", i.e. circumventing Hydra's CLI interface
Still not working, that's weird!
Appears we might be hitting rate limiting, since training a single model worked fine the first time, and then didn't work?
Proposed next steps for debugging:
Check if we can train and upload even a single model (note that syncing continuous after
If we can, but cannot train in parallel, appears it's a wandb problem. Potential next steps:
Write performance to disk and drop wandb support
Switch to another provider (local/remote)
If anyone is up for debugging, they're more than welcome to go ahead and collect thoughts here! @HLasse, @sarakolding, @signekb, @bokajgd, @erikperfalk.
🤖 This is the current blocker for a full hyperparameter search.
Currently have to write the manuscript, so unable to debug in detail. We need a permanent solution shared across projects, therefore the issue here.
Is this related to the latest internet shutdown? Maybe so, but we already open the URLs I could find from wandb
https://api.wandb.ai
.train_models_in_parallel
script--multirun
Appears we might be hitting rate limiting, since training a single model worked fine the first time, and then didn't work?
Proposed next steps for debugging:
Check if we can train and upload even a single model (note that syncing continuous after
If we can, but cannot train in parallel, appears it's a wandb problem. Potential next steps:
If anyone is up for debugging, they're more than welcome to go ahead and collect thoughts here! @HLasse, @sarakolding, @signekb, @bokajgd, @erikperfalk.