ACEsuit / mace

MACE - Fast and accurate machine learning interatomic potentials with higher order equivariant message passing.
Other
412 stars 155 forks source link

How to use multi-training without slurm system? #458

Open stargolike opened 3 weeks ago

stargolike commented 3 weeks ago

Hello dear developers,I run this script python /root/mace/scripts/run_train.py --name="MACE_model" \ --train_file="train.xyz" \ --valid_fraction=0.05 \ --test_file="test.xyz" \ --config_type_weights='{"Default":1.0}' \ --model="MACE" \ --hidden_irreps='128x0e + 128x1o' \ --r_max=5.0 \ --batch_size=10 \ --energy_key="energy" \ --forces_key="forces" \ --max_num_epochs=100 \ --swa \ --start_swa=80 \ --ema \ --ema_decay=0.99 \ --amsgrad \ --restart_latest \ --device=cuda \ But my computer has two 4090 GPUs, and I have not installed Slurm, so this problem occurred ERROR:root:Failed to initialize distributed environment: 'SLURM_ JOB_NODELIST How to solve the problem.

ilyes319 commented 2 weeks ago

If it is just on a single node you can use the interface in the documentation: https://mace-docs.readthedocs.io/en/latest/guide/multigpu.html, on the single-node section.

stargolike commented 2 weeks ago

If it is just on a single node you can use the interface in the documentation: https://mace-docs.readthedocs.io/en/latest/guide/multigpu.html, on the single-node section.

thanks for your reply, i use the tutorial to change my code but i meet new problem

Wed Jun 12 19:38:53 CST 2024
W0612 19:39:23.097889 140291182806208 torch/distributed/run.py:757] 
W0612 19:39:23.097889 140291182806208 torch/distributed/run.py:757] *****************************************
W0612 19:39:23.097889 140291182806208 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0612 19:39:23.097889 140291182806208 torch/distributed/run.py:757] *****************************************
ERROR:root:Failed to initialize distributed environment: 'SLURM_JOB_NODELIST'
ERROR:root:Failed to initialize distributed environment: 'SLURM_JOB_NODELIST'

i'm running

source activate mace
torchrun --standalone --nnodes=1 --nproc_per_node=2 mace/mace/cli/run_train.py  --config="config.yaml"

and the config.yaml

name: MACE_model
config_type_weights: {"Default":1.0}
model: "MACE"
hidden_irrep: '128x0e + 128x1o'
r_max: 5.0
train_file: train.xyz
test_file: test.xyz
valid_fraction: 0.05
batch_size: 10
energy_key: "energy"
forces_key: "forces"
swa: yes
start_swa: 80
ema: yes
ema_decay: 0.99 
amsgrad: yes
restart_latest: yes
max_num_epochs: 100
device: cuda 
loss: "huber"
distributed: yes
ilyes319 commented 2 weeks ago

You should comment out the _setup_distr_env(self): function in mace/tools/slurm_distributed.py

stargolike commented 2 weeks ago

You should comment out the _setup_distr_env(self): function in mace/tools/slurm_distributed.py

I modify the mace package slurm_distributed.py, the root of which is /opt/miniconda/envs/mace/lib/python3.9/site-packages/mace/tools, and it can run. But when I run the multi-train on two 4090 GPUs ( i want to run the bigger system, so i need more memory) , it has some trouble. CUDA out of memory

[rank1]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.43 GiB. GPU  has a total capacity of 23.65 GiB of which 586.25 MiB is free. Including non-PyTorch memory, this process has 23.05 GiB memory in use. Of the allocated memory 18.60 GiB is allocated by PyTorch, and 3.00 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
E0613 12:40:41.468833 140559193007296 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 9550) of binary: /opt/miniconda/envs/mace/bin/python

I think if the multi-train is not opened?

2024-06-13 12:40:09.875 INFO: CUDA version: 12.1, CUDA device: 0
2024-06-13 12:40:11.661 INFO: Using isolated atom energies from training file
2024-06-13 12:40:11.712 INFO: Loaded 768 training configurations from 'train.xyz'
2024-06-13 12:40:11.712 INFO: Using random 5.0% of training set for validation
2024-06-13 12:40:11.941 INFO: Since ASE version 3.23.0b1, using energy_key 'energy' is no longer safe when communicating 
2024-06-13 12:40:12.035 INFO: Loaded 192 test configurations from 'test.xyz'
2024-06-13 12:40:12.035 INFO: Total number of configurations: train=730, valid=38, tests=[Default: 192]
2024-06-13 12:40:12.057 INFO: AtomicNumberTable: (1, 8, 17, 30)
2024-06-13 12:40:12.057 INFO: Atomic energies: [-0.15222862, -0.08918347, -0.07295653, -0.01178265]
2024-06-13 12:40:15.365 INFO: WeightedHuberEnergyForcesStressLoss(energy_weight=1.000, forces_weight=100.000, stress_weight=1.000)
2024-06-13 12:40:17.220 INFO: Average number of neighbors: 52.435900568044566
2024-06-13 12:40:17.221 INFO: Selected the following outputs: {'energy': True, 'forces': True, 'virials': True, 'stress': True, 'dipoles': False}
2024-06-13 12:40:17.466 INFO: Building model
2024-06-13 12:40:17.466 INFO: Hidden irreps: 128x0e + 128x1o
stargolike commented 2 weeks ago

You should comment out the _setup_distr_env(self): function in mace/tools/slurm_distributed.py

i find https://github.com/ACEsuit/mace/pull/143 and i want to solve the problem, so i reinstall the version which has hugfacing. but it also can't run.

The following values were not passed to `accelerate launch` and had defaults used instead:
    `--num_processes` was set to a value of `2`
    `--num_machines` was set to a value of `1`
    `--mixed_precision` was set to a value of `'no'`
    `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.

and it also has out of memory

[rank1]:   File "/opt/miniconda/envs/multi_train/lib/python3.9/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
[rank1]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank1]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.08 GiB. GPU  has a total capacity of 23.65 GiB of which 772.25 MiB is free. Including non-PyTorch memory, this process has 22.87 GiB memory in use. Of the allocated memory 19.08 GiB is allocated by PyTorch, and 2.34 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank0]: Traceback (most recent call last):
[rank0]:   File "/input_lbg-19657-13216338/./mace/scripts/run_train.py", line 596, in <module>
[rank0]:     main()
[rank0]:   File "/input_lbg-19657-13216338/./mace/scripts/run_train.py", line 500, in main
[rank0]:     tools.train(
[rank0]:   File "/opt/miniconda/envs/multi_train/lib/python3.9/site-packages/mace/tools/train.py", line 93, in train
[rank0]:     train_one_epoch(
[rank0]:   File "/opt/miniconda/envs/multi_train/lib/python3.9/site-packages/mace/tools/train.py", line 234, in train_one_epoch
[rank0]:     _, opt_metrics = take_step(
[rank0]:   File "/opt/miniconda/envs/multi_train/lib/python3.9/site-packages/mace/tools/train.py", line 262, in take_step
[rank0]:     output = model(
[rank0]:   File "/opt/miniconda/envs/multi_train/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/opt/miniconda/envs/multi_train/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/opt/miniconda/envs/multi_train/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1593, in forward
[rank0]:     else self._run_ddp_forward(*inputs, **kwargs)
[rank0]:   File "/opt/miniconda/envs/multi_train/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1411, in _run_ddp_forward
[rank0]: 
    return self.module(*inputs, **kwargs)  # type: ignore[index]
[rank0]:   File "/opt/miniconda/envs/multi_train/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/opt/miniconda/envs/multi_train/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/opt/miniconda/envs/multi_train/lib/python3.9/site-packages/mace/modules/models.py", line 344, in forward
[rank0]:     forces, virials, stress = get_outputs(
[rank0]:   File "/opt/miniconda/envs/multi_train/lib/python3.9/site-packages/mace/modules/utils.py", line 135, in get_outputs
[rank0]:     compute_forces(energy=energy, positions=positions, training=training),
[rank0]:   File "/opt/miniconda/envs/multi_train/lib/python3.9/site-packages/mace/modules/utils.py", line 26, in compute_forces
[rank0]:     gradient = torch.autograd.grad(
[rank0]:   File "/opt/miniconda/envs/multi_train/lib/python3.9/site-packages/torch/autograd/__init__.py", line 412, in grad
[rank0]:     result = _engine_run_backward(
[rank0]:   File "/opt/miniconda/envs/multi_train/lib/python3.9/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
[rank0]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.51 GiB. GPU 
W0613 23:39:31.075506 140154674787520 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 45322 closing signal SIGTERM
E0613 23:39:31.139620 140154674787520 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 45321) of binary: /opt/miniconda/envs/multi_train/bin/python

i also test 4 gpus train, but similar problem was happened. i use two methods and i can't make two 4090 gpus running.Is this because MACE copies the complete model to the GPU during the multi-train running?

svandenhaute commented 2 weeks ago

I created a modified train script for this which doesn't use the whole DistributedEnvironment thing.

See here. Essentially it comes down to setting the required environment variables manually:

def main() -> None:
    """
    This script runs the training/fine tuning for mace
    """
    args = tools.build_default_arg_parser().parse_args()
    if args.distributed:
        world_size = torch.cuda.device_count()
        import torch.multiprocessing as mp
        mp.spawn(run, args=(args, world_size), nprocs=world_size)
    else:
        run(0, args, 1)

def run(rank: int, args: argparse.Namespace, world_size: int) -> None:
    """
    This script runs the training/fine tuning for mace
    """
    tag = tools.get_tag(name=args.name, seed=args.seed)
    if args.distributed:
        # try:
        #     distr_env = DistributedEnvironment()
        # except Exception as e:  # pylint: disable=W0703
        #     logging.error(f"Failed to initialize distributed environment: {e}")
        #     return
        # world_size = distr_env.world_size
        # local_rank = distr_env.local_rank
        # rank = distr_env.rank
        # if rank == 0:
        #     print(distr_env)
        # torch.distributed.init_process_group(backend="nccl")
        local_rank = rank
        os.environ["MASTER_ADDR"] = "localhost"
        os.environ["MASTER_PORT"] = "12355"
        torch.cuda.set_device(rank)
        torch.distributed.init_process_group(
            backend='nccl',
            rank=rank,
            world_size=world_size,
        )
    else:
        pass
stargolike commented 2 weeks ago

I created a modified train script for this which doesn't use the whole DistributedEnvironment thing.

See here. Essentially it comes down to setting the required environment variables manually:

def main() -> None:
    """
    This script runs the training/fine tuning for mace
    """
    args = tools.build_default_arg_parser().parse_args()
    if args.distributed:
        world_size = torch.cuda.device_count()
        import torch.multiprocessing as mp
        mp.spawn(run, args=(args, world_size), nprocs=world_size)
    else:
        run(0, args, 1)

def run(rank: int, args: argparse.Namespace, world_size: int) -> None:
    """
    This script runs the training/fine tuning for mace
    """
    tag = tools.get_tag(name=args.name, seed=args.seed)
    if args.distributed:
        # try:
        #     distr_env = DistributedEnvironment()
        # except Exception as e:  # pylint: disable=W0703
        #     logging.error(f"Failed to initialize distributed environment: {e}")
        #     return
        # world_size = distr_env.world_size
        # local_rank = distr_env.local_rank
        # rank = distr_env.rank
        # if rank == 0:
        #     print(distr_env)
        # torch.distributed.init_process_group(backend="nccl")
        local_rank = rank
        os.environ["MASTER_ADDR"] = "localhost"
        os.environ["MASTER_PORT"] = "12355"
        torch.cuda.set_device(rank)
        torch.distributed.init_process_group(
            backend='nccl',
            rank=rank,
            world_size=world_size,
        )
    else:
        pass

hello, i use your method to change the code, but some errors have happened and i can't understand it.

[W socket.cpp:464] [c10d] The server socket has failed to bind to [::]:12355 (errno: 98 - Address already in use).
[W socket.cpp:464] [c10d] The server socket has failed to bind to 0.0.0.0:12355 (errno: 98 - Address already in use).
[E socket.cpp:500] [c10d] The server socket has failed to listen on any local network address.

and the torch also has error,

torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:12355 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:12355 (errno: 98 - Address already in use).

I don't think this is a port problem.Because even if I modify a port that no one has used before, it is still the same error.

svandenhaute commented 2 weeks ago

Sounds like your starting it twice. Make sure to use python run_train.py <your_training_args> instead of using torchrun?

stargolike commented 2 weeks ago

Sounds like your starting it twice. Make sure to use python run_train.py <your_training_args> instead of using torchrun?

thanks for your reply, dear. i change my command and it's

source activate mace
nvidia-smi
python mace/mace/cli/run_train.py  --config="config.yaml"

and cuda memory out too.

  File "/opt/miniconda/envs/mace/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 237, in start_processes
    while not context.join():
  File "/opt/miniconda/envs/mace/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 188, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 
-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/opt/miniconda/envs/mace/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 75, in _wrap
    fn(i, *args)
  File "/input_lbg-19657-13263068/mace/mace/cli/run_train.py", line 705, in run
    tools.train(
  File "/opt/miniconda/envs/mace/lib/python3.9/site-packages/mace/tools/train.py", line 179, in train
    train_one_epoch(
  File "/opt/miniconda/envs/mace/lib/python3.9/site-packages/mace/tools/train.py", line 289, in train_one_epoch
    _, opt_metrics = take_step(
  File "/opt/miniconda/envs/mace/lib/python3.9/site-packages/mace/tools/train.py", line 319, in take_step
    output = model(
  File "/opt/miniconda/envs/mace/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/miniconda/envs/mace/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/miniconda/envs/mace/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1593, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "/opt/miniconda/envs/mace/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1411, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
  File "/opt/miniconda/envs/mace/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/miniconda/envs/mace/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/miniconda/envs/mace/lib/python3.9/site-packages/mace/modules/models.py", line 391, in forward
    forces, virials, stress = get_outputs(
  File "/opt/miniconda/envs/mace/lib/python3.9/site-packages/mace/modules/utils.py", line 126, in get_outputs
    forces, virials, stress = compute_forces_virials(
  File "/opt/miniconda/envs/mace/lib/python3.9/site-packages/mace/modules/utils.py", line 51, in compute_forces_virials
    forces, virials = torch.autograd.grad(
  File "/opt/miniconda/envs/mace/lib/python3.9/site-packages/torch/autograd/__init__.py", line 412, in grad
    result = _engine_run_backward(
  File "/opt/miniconda/envs/mace/lib/python3.9/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.45 GiB. GPU 

i use two 4090 gpus to run.

ilyes319 commented 2 weeks ago

Can you share your new log file? It does not seem to be using the two GPUs.

stargolike commented 2 weeks ago

Can you share your new log file? It does not seem to be using the two GPUs.

thanks, dear ilyes, it's my log file. 1.log