facebookresearch / dora

Dora is an experiment management framework. It expresses grid searches as pure python files as part of your repo. It identifies experiments with a unique hash signature. Scale up to hundreds of experiments without losing your sanity.
MIT License
269 stars 24 forks source link

Can not work on multi machines with multi gpus #53

Open Maggione opened 1 year ago

Maggione commented 1 year ago

❓ Questions

I am trying to run the program on multiple machines with multiple GPUs, but the code can only find multiple machines and use only one GPU on each machine during runtime. Do I need to add additional configurations to run it properly?

/home/fay.cyf/shiyi.zxh/miniconda3/envs/musicgen/lib/python3.9/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
Dora directory: /home/fay.cyf/weixipin.wxp/audiocraft/audiocraft_root
[08-18 02:46:54][audiocraft.solvers.builders][INFO] - Loading audio data split valid: /home/fay.cyf/weixipin.wxp/audiocraft/egs/example
[08-18 02:46:54][audiocraft.solvers.builders][INFO] - Loading audio data split evaluate: /home/fay.cyf/weixipin.wxp/audiocraft/egs/example
[08-18 02:46:54][audiocraft.solvers.builders][INFO] - Loading audio data split generate: /home/fay.cyf/weixipin.wxp/audiocraft/egs/example
[08-18 02:46:54][root][INFO] - Getting pretrained compression model from HF facebook/encodec_32khz
[08-18 02:46:53][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 0
[08-18 02:46:53][torch.distributed.distributed_c10d][INFO] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
[08-18 02:46:53][dora.distrib][INFO] - Distributed init: 0/2 (local 0) from env
[08-18 02:46:53][flashy.solver][INFO] - Instantiating solver MusicGenSolver for XP 9521b0af
[08-18 02:46:53][flashy.solver][INFO] - All XP logs are stored in /home/fay.cyf/weixipin.wxp/audiocraft/audiocraft_root/xps/9521b0af
/home/fay.cyf/shiyi.zxh/miniconda3/envs/musicgen/lib/python3.9/site-packages/flashy/loggers/tensorboard.py:47: UserWarning: tensorboard package was not found: use pip install tensorboard
  warnings.warn("tensorboard package was not found: use pip install tensorboard")
[08-18 02:46:53][audiocraft.solvers.builders][INFO] - Loading audio data split train: /home/fay.cyf/weixipin.wxp/audiocraft/egs/data
[08-18 02:47:34][flashy.solver][INFO] - Compression model has 4 codebooks with 2048 cardinality, and a framerate of 50
[08-18 02:47:34][audiocraft.modules.conditioners][INFO] - T5 will be evaluated with autocast as float32
[08-18 02:47:51][audiocraft.optim.dadam][INFO] - Using decoupled weight decay
[08-18 02:47:53][flashy.solver][INFO] - Model hash: e7554e7f9d6cc2dea51bd31aa3e89765bc73d1dd
[08-18 02:47:53][flashy.solver][INFO] - Initializing EMA on the model with decay = 0.99 every 10 updates
[08-18 02:47:53][flashy.solver][INFO] - Model size: 420.37 M params
[08-18 02:47:53][flashy.solver][INFO] - Base memory usage, with model, grad and optim: 6.73 GB
[08-18 02:47:53][flashy.solver][INFO] - Restoring weights and history.
[08-18 02:47:53][flashy.solver][INFO] - Loading a pretrained model. Ignoring 'load_best' and 'ignore_state_keys' params.
[08-18 02:48:00][flashy.solver][INFO] - Checkpoint source is not the current xp: Load state_dict from best state.
[08-18 02:48:00][flashy.solver][INFO] - Ignoring keys when loading best []
damo-pod7-0129:1:1 [0] NCCL INFO Bootstrap : Using bond0:33.57.143.227<0>
damo-pod7-0129:1:1 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
damo-pod7-0129:1:1 [0] NCCL INFO cudaDriverVersion 11070
NCCL version 2.14.3+cuda11.7
[08-18 02:48:06][flashy.solver][INFO] - Model hash: 776d041cbbcb8973c4968782a79f9bb63b53a727
[08-18 02:48:04][flashy.solver][INFO] - Re-initializing EMA from best state
[08-18 02:48:04][flashy.solver][INFO] - Initializing EMA on the model with decay = 0.99 every 10 updates
[08-18 02:48:03][flashy.solver][INFO] - Loading state_dict from best state.
adefossez commented 1 year ago

Can you check that you do indeed see all the gpus when running python, e.g. check torch.cuda.device_count() ?

adefossez commented 1 year ago

by the way you do need to run one process per GPU, even on a single machine