OpenThaiGPT / openthaigpt-pretraining

Apache License 2.0
21 stars 10 forks source link

add script and doc for wandb sync #325

Open kriangkraitan opened 10 months ago

kriangkraitan commented 10 months ago

Why this PR

add script and doc for wandb sync

Changes

Related Issues

Close #

Checklist

linear[bot] commented 10 months ago
LM-207 Set up wandb auto sync in Lanta

Rationale: running wandb on Lanta is forced to be on offline mode. We need to manully running wandb sync command on wandb log folder to upload latest training result to wandb cloud. This makes LLM training pipeline monitoring difficult. We want to automate the syncing by making wandb auto syncing wandb-offline folder on Lanta frontend node each 30 minutes on Lanta with python script and tmux Step by step: 1. Install tmux on Lanta using easybuild localmodule [https://thaisc.atlassian.net/wiki/spaces/UG/pages/159350813/local+module+TARA+Cluste](https://thaisc.atlassian.net/wiki/spaces/UG/pages/159350813/local+module+TARA+Cluster) 2. Write python/bash script that execute command `wandb sync ` each 30 minutes 3. Run simple training script [https://github.com/wandb/examples/blob/master/examples/pytorch-lightning/mnist.py](https://github.com/wandb/examples/blob/master/examples/pytorch-lightning/mnist.py) on lants 1 gpu device with wandb offline mode `export WANDB_MODE=offline` 4. Run the wandb sync script on tmux 5. Check the results on 30 minutes, 60 minutes Definition of done: * Wandb is updated at 30 minutes and 60 minutes as intended * Auto sync script should use same environment as openthaigpt model [https://github.com/OpenThaiGPT/openthaigpt-pretraining/tree/main/src/model](https://github.com/OpenThaiGPT/openthaigpt-pretraining/tree/main/src/model) * Open PR to new folder called wandb sync on [https://github.com/OpenThaiGPT/openthaigpt-pretraining/tree/main/src/model/scripts](https://github.com/OpenThaiGPT/openthaigpt-pretraining/tree/main/src/model/scripts) * Write readme explain how to install and run tmux with auto sync script Out of Scope: * Actually finished training the model Tips: * Install tmux can takes time if you select gcc variant. `ml load ncurses` can be use instead of gcc if you want to speed up installing speed on non gcc variant tmux * peerawat.roj may have easier way to install tmux than this card instruction. Requester: new17353

codecov[bot] commented 10 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (5ff5762) 64.47% compared to head (f92f53b) 19.39%. Report is 1 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #325 +/- ## =========================================== - Coverage 64.47% 19.39% -45.08% =========================================== Files 11 25 +14 Lines 425 1392 +967 =========================================== - Hits 274 270 -4 - Misses 151 1122 +971 ``` | [Flag](https://app.codecov.io/gh/OpenThaiGPT/openthaigpt-pretraining/pull/325/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=OpenThaiGPT) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/OpenThaiGPT/openthaigpt-pretraining/pull/325/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=OpenThaiGPT) | `19.39% <ø> (-45.08%)` | :arrow_down: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=OpenThaiGPT#carryforward-flags-in-the-pull-request-comment) to find out more. [see 36 files with indirect coverage changes](https://app.codecov.io/gh/OpenThaiGPT/openthaigpt-pretraining/pull/325/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=OpenThaiGPT)

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.