ecmwf / anemoi-training

Apache License 2.0
17 stars 16 forks source link

Bare-metal multi-GPU training fails launching subprocesses due to unexpected args #151

Open PatrickESA opened 2 days ago

PatrickESA commented 2 days ago

What happened?

Launching

HYDRA_FULL_ERROR=1 ANEMOI_BASE_SEED=1 anemoi-training train --config-name happy_little_config --config-dir=/pathToConfigs/config

for training models in a multi-GPU setup (on a single machine, outside of a SLURM environment) fails to launch sub-processes successfully. Specifically, the first sub-process is initiated properly but processes afterwards error (see: Relevant log output). Training runs successful with a single-GPU setup but fails when using multiple devices. The desired behavior is for multi-GPU training to be feasible outside of a SLURM environment. The issue might require further looking into, but it may be related to this or that.

What are the steps to reproduce the bug?

On bare-metal (or: without using any resource scheduler):

In the configurations, if using

hardware:
    num_gpus_per_node: 1
    num_nodes: 1
    num_gpus_per_model: 1

the process launches successfully and trains as expected. However, when changing to

hardware:
    num_gpus_per_node: 2
    num_nodes: 1
    num_gpus_per_model: 1

or

hardware:
    num_gpus_per_node: 2
    num_nodes: 1
    num_gpus_per_model: 2

the process crashes during creation of child processes while parsing arguments that are not recognized.

Version

anemoi-training 0.3.0 (from pip)

Platform (OS and architecture)

Linux eohpc-phigpu27 5.4.0-125-generic #141-Ubuntu SMP Wed Aug 10 13:42:03 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Relevant log output

...
[2024-11-19 16:44:28,847][anemoi.models.layers.attention][WARNING] - Flash attention not available, falling back to pytorch scaled_dot_product_attention
[2024-11-19 16:44:33,689][anemoi.training.train.forecaster][INFO] - Pressure level scaling: use scaler ReluPressureLevelScaler with slope 0.0010 and minimum 0.20 

[2024-11-19 16:44:35,481][lightning_fabric.utilities.distributed][INFO] - Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
usage: .__main__.py-train [--help] [--hydra-help] [--version] [--cfg {job,hydra,all}] [--resolve] [--package PACKAGE] [--run] [--multirun] [--shell-completion] [--config-path CONFIG_PATH] [--config-name CONFIG_NAME] [--config-dir CONFIG_DIR] [--experimental-rerun EXPERIMENTAL_RERUN] [--info [{all,config,defaults,defaults-tree,plugins,searchpath}]] [overrides ...]
.anemoi-training-train: error: unrecognized arguments: hydra.run.dir="outputs/2024-11-19/16-43-03" hydra.job.name=train_ddp_process_1 hyra.output_subdir=null

[2024-11-19 16:44:43,698][lightning_fabric.strategies.launchers.subprocess_script][INFO] - [rank: 1] Child process with PID 763811 terminated with code 2. Forcefully terminating all other processes to avoid zombies 🧟 
Killed

Accompanying data

No response

Organisation

No response

PatrickESA commented 2 days ago

In case additional platform information is relevant:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A6000    On   | 00000000:00:05.0 Off |                    0 |
| 30%   37C    P8    27W / 300W |      0MiB / 45634MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000    On   | 00000000:00:06.0 Off |                    0 |
| 30%   36C    P8    29W / 300W |      0MiB / 45634MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
ssmmnn11 commented 2 days ago

seems to be similar to what was fixed here : https://github.com/ecmwf/anemoi-training/pull/82#issue-2581613296

gmertes commented 1 day ago

Can you try running without --config-dir and report back:

Just cd into the directory where happy_little_config.yaml exists, and then:

HYDRA_FULL_ERROR=1 ANEMOI_BASE_SEED=1 anemoi-training train --config-name=happy_little_config

We did fix multi-GPU training in #82 and that is merged into 0.3, but we may have a regression somewhere.

PatrickESA commented 1 day ago

Thanks for the reference, this is insightful. However, doing a cd to .../lib/python3.11/site-packages/anemoi/training/config and running HYDRA_FULL_ERROR=1 ANEMOI_BASE_SEED=1 anemoi-training train --config-name=happy_little_config still gives the same error as in the original post. Just let me know what additional info might be useful, some of the relevant packages in my environment are:

anemoi-training 0.3.0 pypi_0 pypi anemoi-utils 0.4.8 pypi_0 pypi hydra-core 1.3.2 pypi_0 pypi pytorch-lightning 2.4.0 pypi_0 pypi

gmertes commented 1 hour ago

Thanks, will investigate and get back to you asap.