Open kriangkraitan opened 1 year ago
Rationale: running wandb on Lanta is forced to be on offline mode. We need to manully running wandb sync command on wandb log folder to upload latest training result to wandb cloud. This makes LLM training pipeline monitoring difficult.
We want to automate the syncing by making wandb auto syncing wandb-offline folder on Lanta frontend node each 30 minutes on Lanta with python script and tmux
Step by step:
1. Install tmux on Lanta using easybuild localmodule [https://thaisc.atlassian.net/wiki/spaces/UG/pages/159350813/local+module+TARA+Cluste](https://thaisc.atlassian.net/wiki/spaces/UG/pages/159350813/local+module+TARA+Cluster)
2. Write python/bash script that execute command `wandb sync
All modified and coverable lines are covered by tests :white_check_mark:
Comparison is base (
5ff5762
) 64.47% compared to head (f92f53b
) 19.39%. Report is 1 commits behind head on main.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Why this PR
add script and doc for wandb sync
Changes
Related Issues
Close #
Checklist