clessig / atmorep

AtmoRep model code
MIT License
44 stars 11 forks source link

Enable wandb in online mode during training on JSC #43

Open nish03 opened 2 months ago

nish03 commented 2 months ago

Any particular reason why Wandb is currently in offline mode during training? Is it related to AtmoRep facing potential syncing issues with Wandb server during training?

clessig commented 2 months ago

juwels-booster compute nodes don't have internet access; runs will fail in online mode. There's probably a way to tunnel out of the compute nodes (according to Stefan Kesselheim) but I never got around to testing and automating it. @grassesi, maybe Michael or you or Stefan's group can look into this at some point.

grassesi commented 2 weeks ago

I dont know, as far as I know there is no way to open outgoing ssh connections directly from the HPC (neither compute nor login nodes) for security concerns.

iluise commented 2 weeks ago

Can we close the issue and add a reference to this discussion in the Wiki so if people wonder they can see this thread?

clessig commented 2 weeks ago

According to Stefan Kesselheim, there is a way to tunnel out of the compute nodes to have wandb run in online mode. It's not portable but if most of development is happening at JSC then it might still be worth deploying.

@grassesi : maybe you can talk to Stefan Kesselheim?

Renamed the issues and leaving it open.