bug: logging to wandb does not work

🤖 This is the current blocker for a full hyperparameter search.

Currently have to write the manuscript, so unable to debug in detail. We need a permanent solution shared across projects, therefore the issue here.

Is this related to the latest internet shutdown? Maybe so, but we already open the URLs I could find from wandb https://api.wandb.ai.

Train single model - seemed to work fine
Run full model training using the train_models_in_parallel script
It only runs exactly 10 models, one for each lookahead/model architecture combination
- I.e. it doesn't start more than one model for each combination, even though we get
  - This implies there's something wrong with our hyperparameter search.
    - Running the command directly from the command line
      - Seems that this does not sync correctly. Stuck on the last line.
  - Suggests that it's a wandb/hydra interaction that's causing the problem, but only when using --multirun
    - Testing without multirun
      - Huh, same problem!
  - Testing again, but just a simple "train model", i.e. circumventing Hydra's CLI interface
    - Still not working, that's weird!

Appears we might be hitting rate limiting, since training a single model worked fine the first time, and then didn't work?

Proposed next steps for debugging:

Check if we can train and upload even a single model (note that syncing continuous after
If we can, but cannot train in parallel, appears it's a wandb problem. Potential next steps:
- Write performance to disk and drop wandb support
- Switch to another provider (local/remote)

If anyone is up for debugging, they're more than welcome to go ahead and collect thoughts here! @HLasse, @sarakolding, @signekb, @bokajgd, @erikperfalk.

Aarhus-Psychiatry-Research / psycop-model-training

bug: logging to wandb does not work #487