Open PatrickESA opened 2 days ago
In case additional platform information is relevant:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A6000 On | 00000000:00:05.0 Off | 0 |
| 30% 37C P8 27W / 300W | 0MiB / 45634MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX A6000 On | 00000000:00:06.0 Off | 0 |
| 30% 36C P8 29W / 300W | 0MiB / 45634MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
seems to be similar to what was fixed here : https://github.com/ecmwf/anemoi-training/pull/82#issue-2581613296
Can you try running without --config-dir and report back:
Just cd
into the directory where happy_little_config.yaml
exists, and then:
HYDRA_FULL_ERROR=1 ANEMOI_BASE_SEED=1 anemoi-training train --config-name=happy_little_config
We did fix multi-GPU training in #82 and that is merged into 0.3, but we may have a regression somewhere.
Thanks for the reference, this is insightful. However, doing a cd to .../lib/python3.11/site-packages/anemoi/training/config
and running HYDRA_FULL_ERROR=1 ANEMOI_BASE_SEED=1 anemoi-training train --config-name=happy_little_config
still gives the same error as in the original post. Just let me know what additional info might be useful, some of the relevant packages in my environment are:
anemoi-training 0.3.0 pypi_0 pypi anemoi-utils 0.4.8 pypi_0 pypi hydra-core 1.3.2 pypi_0 pypi pytorch-lightning 2.4.0 pypi_0 pypi
Thanks, will investigate and get back to you asap.
What happened?
Launching
HYDRA_FULL_ERROR=1 ANEMOI_BASE_SEED=1 anemoi-training train --config-name happy_little_config --config-dir=/pathToConfigs/config
for training models in a multi-GPU setup (on a single machine, outside of a SLURM environment) fails to launch sub-processes successfully. Specifically, the first sub-process is initiated properly but processes afterwards error (see: Relevant log output). Training runs successful with a single-GPU setup but fails when using multiple devices. The desired behavior is for multi-GPU training to be feasible outside of a SLURM environment. The issue might require further looking into, but it may be related to this or that.
What are the steps to reproduce the bug?
On bare-metal (or: without using any resource scheduler):
In the configurations, if using
the process launches successfully and trains as expected. However, when changing to
or
the process crashes during creation of child processes while parsing arguments that are not recognized.
Version
anemoi-training 0.3.0 (from pip)
Platform (OS and architecture)
Linux eohpc-phigpu27 5.4.0-125-generic #141-Ubuntu SMP Wed Aug 10 13:42:03 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Relevant log output
Accompanying data
No response
Organisation
No response