Can not work on multi machines with multi gpus - Githubissues

facebookresearch / dora

Dora is an experiment management framework. It expresses grid searches as pure python files as part of your repo. It identifies experiments with a unique hash signature. Scale up to hundreds of experiments without losing your sanity.

MIT License

269 stars 24 forks source link

Can not work on multi machines with multi gpus #53

Open Maggione opened 1 year ago

Maggione commented 1 year ago

❓ Questions

I am trying to run the program on multiple machines with multiple GPUs, but the code can only find multiple machines and use only one GPU on each machine during runtime. Do I need to add additional configurations to run it properly?

/home/fay.cyf/shiyi.zxh/miniconda3/envs/musicgen/lib/python3.9/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
Dora directory: /home/fay.cyf/weixipin.wxp/audiocraft/audiocraft_root
[[36m08-18 02:46:54[0m][[34maudiocraft.solvers.builders[0m][[32mINFO[0m] - Loading audio data split valid: /home/fay.cyf/weixipin.wxp/audiocraft/egs/example[0m
[[36m08-18 02:46:54[0m][[34maudiocraft.solvers.builders[0m][[32mINFO[0m] - Loading audio data split evaluate: /home/fay.cyf/weixipin.wxp/audiocraft/egs/example[0m
[[36m08-18 02:46:54[0m][[34maudiocraft.solvers.builders[0m][[32mINFO[0m] - Loading audio data split generate: /home/fay.cyf/weixipin.wxp/audiocraft/egs/example[0m
[[36m08-18 02:46:54[0m][[34mroot[0m][[32mINFO[0m] - Getting pretrained compression model from HF facebook/encodec_32khz[0m
[[36m08-18 02:46:53[0m][[34mtorch.distributed.distributed_c10d[0m][[32mINFO[0m] - Added key: store_based_barrier_key:1 to store for rank: 0[0m
[[36m08-18 02:46:53[0m][[34mtorch.distributed.distributed_c10d[0m][[32mINFO[0m] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.[0m
[[36m08-18 02:46:53[0m][[34mdora.distrib[0m][[32mINFO[0m] - Distributed init: 0/2 (local 0) from env[0m
[[36m08-18 02:46:53[0m][[34mflashy.solver[0m][[32mINFO[0m] - Instantiating solver MusicGenSolver for XP 9521b0af[0m
[[36m08-18 02:46:53[0m][[34mflashy.solver[0m][[32mINFO[0m] - All XP logs are stored in /home/fay.cyf/weixipin.wxp/audiocraft/audiocraft_root/xps/9521b0af[0m
/home/fay.cyf/shiyi.zxh/miniconda3/envs/musicgen/lib/python3.9/site-packages/flashy/loggers/tensorboard.py:47: UserWarning: tensorboard package was not found: use pip install tensorboard
  warnings.warn("tensorboard package was not found: use pip install tensorboard")
[[36m08-18 02:46:53[0m][[34maudiocraft.solvers.builders[0m][[32mINFO[0m] - Loading audio data split train: /home/fay.cyf/weixipin.wxp/audiocraft/egs/data[0m
[[36m08-18 02:47:34[0m][[34mflashy.solver[0m][[32mINFO[0m] - Compression model has 4 codebooks with 2048 cardinality, and a framerate of 50[0m
[[36m08-18 02:47:34[0m][[34maudiocraft.modules.conditioners[0m][[32mINFO[0m] - T5 will be evaluated with autocast as float32[0m
[[36m08-18 02:47:51[0m][[34maudiocraft.optim.dadam[0m][[32mINFO[0m] - Using decoupled weight decay[0m
[[36m08-18 02:47:53[0m][[34mflashy.solver[0m][[32mINFO[0m] - Model hash: e7554e7f9d6cc2dea51bd31aa3e89765bc73d1dd[0m
[[36m08-18 02:47:53[0m][[34mflashy.solver[0m][[32mINFO[0m] - Initializing EMA on the model with decay = 0.99 every 10 updates[0m
[[36m08-18 02:47:53[0m][[34mflashy.solver[0m][[32mINFO[0m] - Model size: 420.37 M params[0m
[[36m08-18 02:47:53[0m][[34mflashy.solver[0m][[32mINFO[0m] - Base memory usage, with model, grad and optim: 6.73 GB[0m
[[36m08-18 02:47:53[0m][[34mflashy.solver[0m][[32mINFO[0m] - Restoring weights and history.[0m
[[36m08-18 02:47:53[0m][[34mflashy.solver[0m][[32mINFO[0m] - Loading a pretrained model. Ignoring 'load_best' and 'ignore_state_keys' params.[0m
[[36m08-18 02:48:00[0m][[34mflashy.solver[0m][[32mINFO[0m] - Checkpoint source is not the current xp: Load state_dict from best state.[0m
[[36m08-18 02:48:00[0m][[34mflashy.solver[0m][[32mINFO[0m] - Ignoring keys when loading best [][0m
damo-pod7-0129:1:1 [0] NCCL INFO Bootstrap : Using bond0:33.57.143.227<0>
damo-pod7-0129:1:1 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
damo-pod7-0129:1:1 [0] NCCL INFO cudaDriverVersion 11070
NCCL version 2.14.3+cuda11.7
[[36m08-18 02:48:06[0m][[34mflashy.solver[0m][[32mINFO[0m] - Model hash: 776d041cbbcb8973c4968782a79f9bb63b53a727[0m
[[36m08-18 02:48:04[0m][[34mflashy.solver[0m][[32mINFO[0m] - Re-initializing EMA from best state[0m
[[36m08-18 02:48:04[0m][[34mflashy.solver[0m][[32mINFO[0m] - Initializing EMA on the model with decay = 0.99 every 10 updates[0m
[[36m08-18 02:48:03[0m][[34mflashy.solver[0m][[32mINFO[0m] - Loading state_dict from best state.[0m

adefossez commented 1 year ago

Can you check that you do indeed see all the gpus when running python, e.g. check torch.cuda.device_count() ?

adefossez commented 1 year ago

by the way you do need to run one process per GPU, even on a single machine