ChildFailedError when continuing the fine-tune training based on MP pretrained model using 2 GPUs.

turbosonics commented 2 months ago

I performed fine-tune training based on MP pretrained model using 2 GPUs for a week (our local cluster has 1 week wall time limit). Following is the input file parameters for the fine-tune training.

model:
    chemical_species: 'Auto'
    cutoff: 5.0
    channel: 128
    is_parity: False
    lmax: 2
    num_convolution_layer: 5
    irreps_manual:
        - "128x0e"
        - "128x0e+64x1e+32x2e"
        - "128x0e+64x1e+32x2e"
        - "128x0e+64x1e+32x2e"
        - "128x0e+64x1e+32x2e"
        - "128x0e"

    weight_nn_hidden_neurons: [64, 64]
    radial_basis:
        radial_basis_name: 'bessel'
        bessel_basis_num: 8
    cutoff_function:
        cutoff_function_name: 'XPLOR'
        cutoff_on: 4.5

    act_gate: {'e': 'silu', 'o': 'tanh'}
    act_scalar: {'e': 'silu', 'o': 'tanh'}

    conv_denominator: "avg_num_neigh"
    train_shift_scale: True
    train_denominator: True
    self_connection_type: 'linear'

train:
    train_shuffle: True
    random_seed: 123
    is_train_stress: True
    epoch: 5000

    loss: 'Huber'
    loss_param:
        delta: 0.01

    optimizer: 'adam'
    optim_param:
        lr: 0.002
    scheduler: 'ReduceLROnPlateau'
    scheduler_param:
        factor: 0.5
        patience: 100
#        best_metric: TotalLoss
    force_loss_weight: 2.0
    stress_loss_weight: 0.1

    per_epoch: 10
    error_record:
        - ['Energy', 'RMSE']
        - ['Force', 'RMSE']
        - ['Stress', 'RMSE']
        - ['Energy', 'MAE']
        - ['Force', 'MAE']
        - ['Stress', 'MAE']
        - ['Energy', 'Loss']
        - ['Force', 'Loss']
        - ['Stress', 'Loss']
        - ['TotalLoss', 'None']

    continue:
        reset_optimizer: True
        reset_scheduler: True
        reset_epoch: True
        checkpoint: '${PTMODEL_DIR}/${PTMODEL_NAME}'
        # Set True to use shift, scale, and avg_num_neigh from checkpoint (highly recommended)
        use_statistic_values_of_checkpoint: True

data:
    data_shuffle: True
    batch_size: 6
    data_divide_ratio: 0.2
    data_format: 'ase'
    load_dataset_path: ['${GEOMETRY_DIR}/${GEOMETRY_TRAIN_NAME}']

In this case, the MP pretrained model for "checkpoint" option is {SevenNet_directory}/sevenn/pretrained_potentials/SevenNet_0__11July2024/checkpoint_sevennet_0.pth file.

Now, the training job has been killed by the server due to 1 week wall-time limit, so I tried to restart (or continue) this training work by following input script:

model:
    chemical_species: 'Auto'
    cutoff: 5.0
    channel: 128
    is_parity: False
    lmax: 2
    num_convolution_layer: 5
    irreps_manual:
        - "128x0e"
        - "128x0e+64x1e+32x2e"
        - "128x0e+64x1e+32x2e"
        - "128x0e+64x1e+32x2e"
        - "128x0e+64x1e+32x2e"
        - "128x0e"

    weight_nn_hidden_neurons: [64, 64]
    radial_basis:
        radial_basis_name: 'bessel'
        bessel_basis_num: 8
    cutoff_function:
        cutoff_function_name: 'XPLOR'
        cutoff_on: 4.5

    act_gate: {'e': 'silu', 'o': 'tanh'}
    act_scalar: {'e': 'silu', 'o': 'tanh'}

    conv_denominator: "avg_num_neigh"
    train_shift_scale: True
    train_denominator: True
    self_connection_type: 'linear'

train:
    train_shuffle: True
    random_seed: 123
    is_train_stress: True
    epoch: 5000

    loss: 'Huber'
    loss_param:
        delta: 0.01

    optimizer: 'adam'
    optim_param:
        lr: 0.002
    scheduler: 'ReduceLROnPlateau'
    scheduler_param:
        factor: 0.5
        patience: 100
#        best_metric: TotalLoss
    force_loss_weight: 2.0
    stress_loss_weight: 0.1

    per_epoch: 10
    error_record:
        - ['Energy', 'RMSE']
        - ['Force', 'RMSE']
        - ['Stress', 'RMSE']
        - ['Energy', 'MAE']
        - ['Force', 'MAE']
        - ['Stress', 'MAE']
        - ['Energy', 'Loss']
        - ['Force', 'Loss']
        - ['Stress', 'Loss']
        - ['TotalLoss', 'None']

    continue:
        reset_optimizer: False
        reset_scheduler: False
        reset_epoch: False
        checkpoint: '${PTMODEL_DIR}/${PTMODEL_NAME}'
        # Set True to use shift, scale, and avg_num_neigh from checkpoint (highly recommended)
        use_statistic_values_of_checkpoint: True

data:
    data_shuffle: True
    batch_size: 6
    data_divide_ratio: 0.2
    data_format: 'ase'
    load_dataset_path: ['${GEOMETRY_DIR}/${GEOMETRY_TRAIN_NAME}']

In this case, the geometry files is the same one with the original first fine-tuning training, but I used {First_fine_tune_directory}/checkpoint_best.pth for the "checkpoint" option this time because it is continuation of the first fine-tuning training.

Then I changed three options, reset_optimizer, reset_scheduler, and reset_epoch from True to False. Then I submitted the job using the exact same Slurm condition.

Command to execute the SevenNet is the same for original first fine-tuning job and continuing fine-tuning job: srun torchrun --nnodes 2 --nproc_per_node 1 --rdzv_endpoint=$head_node_ip --rdzv_id=$RANDOM --rdzv_backend=c10d --no_python sevenn input.yaml -d

I also loaded same modules and performed from the same virtual environment for the first fine-tuning training job and the continuing fine-tuning training job:

source /home/venv_sevennet_gpu_cuda118_pytorch230/bin/activate
module load python39
module load nccl
module load cuda11.8
module load git

I used the same commands to escape rendezvous crash for two jobs:

export TORCHELASTIC_ENABLE_FILE_TIMER=true

nodes=( $( scontrol show hostnames $SLURM_JOB_NODELIST ) )
nodes_array=($nodes)
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)

echo $nodes $head_node $SLURM_JOB_NUM_NODES
echo Node IP: $head_node_ip
export MASTER_ADDR=$head_node_ip
export MASTER_PORT=20001
export LOGLEVEL=INFO

Then, I faced "torch.distributed.elastic.multiprocessing.errors.ChildFailedError" Actual error file is too long to copy here...

I can't understand this. The first training job to fine-tune the MP pretrained model workred with 2 GPUs. But then, how and why restart/continue job fails with almost same input parameter setting and exactly the same Slrum setting? Am I doing something wrong to the input file for this restart/continuation of fine-tune training job?

CPU memory cannot be an issue because I requested 496GB (whole memory of node) for two GPU nodes. Since the original first fine-tune training operated normally, this shouldn't be an issue of GPU memory as well.

Should I need to use the same GPU node that I used for the first fine-tune training for the continuing fine-tune training? But I just tried the same GPU nodes with the first fine tuning training, still printed out the same Child Faild error crash.

I compiled SevenNet with Cuda11.8 and prebuilt versrion of Pytorch 2.3.0 (which we can download from https://pytorch.org/get-started/locally/) and python 3.9 to virtual environment.

YutackPark commented 2 months ago

That is strange. Could you copy paste whole error backtrace?

켬 2024. 9. 4. 위치 오전 4:23, turbosonics @.***> 작성:

I performed fine-tune training based on MP pretrained model using 2 GPUs for a week (our local cluster has 1 week wall time limit). Following is the input file parameters for the fine-tune training.

model: chemical_species: 'Auto' cutoff: 5.0 channel: 128 is_parity: False lmax: 2 num_convolution_layer: 5 irreps_manual: - "128x0e" - "128x0e+64x1e+32x2e" - "128x0e+64x1e+32x2e" - "128x0e+64x1e+32x2e" - "128x0e+64x1e+32x2e" - "128x0e" weight_nn_hidden_neurons: [64, 64] radial_basis: radial_basis_name: 'bessel' bessel_basis_num: 8 cutoff_function: cutoff_function_name: 'XPLOR' cutoff_on: 4.5 act_gate: {'e': 'silu', 'o': 'tanh'} act_scalar: {'e': 'silu', 'o': 'tanh'} conv_denominator: "avg_num_neigh" train_shift_scale: True train_denominator: True self_connection_type: 'linear' train: train_shuffle: True random_seed: 123 is_train_stress: True epoch: 5000 loss: 'Huber' loss_param: delta: 0.01 optimizer: 'adam' optim_param: lr: 0.002 scheduler: 'ReduceLROnPlateau' scheduler_param: factor: 0.5 patience: 100 # best_metric: TotalLoss force_loss_weight: 2.0 stress_loss_weight: 0.1 per_epoch: 10 error_record: - ['Energy', 'RMSE'] - ['Force', 'RMSE'] - ['Stress', 'RMSE'] - ['Energy', 'MAE'] - ['Force', 'MAE'] - ['Stress', 'MAE'] - ['Energy', 'Loss'] - ['Force', 'Loss'] - ['Stress', 'Loss'] - ['TotalLoss', 'None'] continue: reset_optimizer: True reset_scheduler: True reset_epoch: True checkpoint: '${PTMODEL_DIR}/${PTMODEL_NAME}' # Set True to use shift, scale, and avg_num_neigh from checkpoint (highly recommended) use_statistic_values_of_checkpoint: True data: data_shuffle: True batch_size: 6 data_divide_ratio: 0.2 data_format: 'ase' load_dataset_path: ['${GEOMETRY_DIR}/${GEOMETRY_TRAIN_NAME}']

In this case, the MP pretrained model for "checkpoint" option is {SevenNet directory}/sevenn/pretrained_potentials/SevenNet_0__11July2024/checkpoint_sevennet_0.pth file.

Now, the training job has been killed by the server due to 1 week wall-time limit, so I tried to restart (or continue) this training work by following input script:

model: chemical_species: 'Auto' cutoff: 5.0 channel: 128 is_parity: False lmax: 2 num_convolution_layer: 5 irreps_manual: - "128x0e" - "128x0e+64x1e+32x2e" - "128x0e+64x1e+32x2e" - "128x0e+64x1e+32x2e" - "128x0e+64x1e+32x2e" - "128x0e" weight_nn_hidden_neurons: [64, 64] radial_basis: radial_basis_name: 'bessel' bessel_basis_num: 8 cutoff_function: cutoff_function_name: 'XPLOR' cutoff_on: 4.5 act_gate: {'e': 'silu', 'o': 'tanh'} act_scalar: {'e': 'silu', 'o': 'tanh'} conv_denominator: "avg_num_neigh" train_shift_scale: True train_denominator: True self_connection_type: 'linear' train: train_shuffle: True random_seed: 123 is_train_stress: True epoch: 5000 loss: 'Huber' loss_param: delta: 0.01 optimizer: 'adam' optim_param: lr: 0.002 scheduler: 'ReduceLROnPlateau' scheduler_param: factor: 0.5 patience: 100 # best_metric: TotalLoss force_loss_weight: 2.0 stress_loss_weight: 0.1 per_epoch: 10 error_record: - ['Energy', 'RMSE'] - ['Force', 'RMSE'] - ['Stress', 'RMSE'] - ['Energy', 'MAE'] - ['Force', 'MAE'] - ['Stress', 'MAE'] - ['Energy', 'Loss'] - ['Force', 'Loss'] - ['Stress', 'Loss'] - ['TotalLoss', 'None'] continue: reset_optimizer: False reset_scheduler: False reset_epoch: False checkpoint: '${PTMODEL_DIR}/${PTMODEL_NAME}' # Set True to use shift, scale, and avg_num_neigh from checkpoint (highly recommended) use_statistic_values_of_checkpoint: True data: data_shuffle: True batch_size: 6 data_divide_ratio: 0.2 data_format: 'ase' load_dataset_path: ['${GEOMETRY_DIR}/${GEOMETRY_TRAIN_NAME}']

In this case, the geometry files are the same, but I used {First_fine_tune_directory}/checkpoint_best.pth for the "checkpoint" option. Then I changed three options, reset_optimizer, reset_scheduler, and reset_epoch from True to False. Then I submitted the job using the exact same Slurm condition.

Then, I faced "torch.distributed.elastic.multiprocessing.errors.ChildFailedError"

I can't understand this. The first training job to fine-tune the MP pretrained model workred with 2 GPUs. But then, how and why restart/continue job fails with almost same input parameter setting and exactly the same Slrum setting? Am I doing something wrong during restart/continuation of fine-tune training job?

I compiled SevenNet with Cuda11.8 and prebuilt versrion of Pytorch 2.3.0 (which we can download from https://pytorch.org/get-started/locally/) and python 3.9 to virtual environment.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

turbosonics commented 2 months ago

Here is the full error file contents printed out by our server cluster

I0903 16:36:04.240921 23456247946112 torch/distributed/launcher/api.py:188] Starting elastic_operator with launch configs:
I0903 16:36:04.240921 23456247946112 torch/distributed/launcher/api.py:188]   entrypoint       : sevenn
I0903 16:36:04.240921 23456247946112 torch/distributed/launcher/api.py:188]   min_nodes        : 2
I0903 16:36:04.240921 23456247946112 torch/distributed/launcher/api.py:188]   max_nodes        : 2
I0903 16:36:04.240921 23456247946112 torch/distributed/launcher/api.py:188]   nproc_per_node   : 1
I0903 16:36:04.240921 23456247946112 torch/distributed/launcher/api.py:188]   run_id           : 28673
I0903 16:36:04.240921 23456247946112 torch/distributed/launcher/api.py:188]   rdzv_backend     : c10d
I0903 16:36:04.240921 23456247946112 torch/distributed/launcher/api.py:188]   rdzv_endpoint    : 10.181.132.133
I0903 16:36:04.240921 23456247946112 torch/distributed/launcher/api.py:188]   rdzv_configs     : {'timeout': 900}
I0903 16:36:04.241193 23456247946112 torch/distributed/launcher/api.py:188] Starting elastic_operator with launch configs:
I0903 16:36:04.241193 23456247946112 torch/distributed/launcher/api.py:188]   entrypoint       : sevenn
I0903 16:36:04.241193 23456247946112 torch/distributed/launcher/api.py:188]   min_nodes        : 2
I0903 16:36:04.241193 23456247946112 torch/distributed/launcher/api.py:188]   max_nodes        : 2
I0903 16:36:04.241193 23456247946112 torch/distributed/launcher/api.py:188]   nproc_per_node   : 1
I0903 16:36:04.241193 23456247946112 torch/distributed/launcher/api.py:188]   run_id           : 28673
I0903 16:36:04.241193 23456247946112 torch/distributed/launcher/api.py:188]   rdzv_backend     : c10d
I0903 16:36:04.241193 23456247946112 torch/distributed/launcher/api.py:188]   rdzv_endpoint    : 10.181.132.133
I0903 16:36:04.241193 23456247946112 torch/distributed/launcher/api.py:188]   rdzv_configs     : {'timeout': 900}
I0903 16:36:04.241193 23456247946112 torch/distributed/launcher/api.py:188]   max_restarts     : 0
I0903 16:36:04.241193 23456247946112 torch/distributed/launcher/api.py:188]   monitor_interval : 5
I0903 16:36:04.241193 23456247946112 torch/distributed/launcher/api.py:188]   log_dir          : /tmp/user1/torchelastic_ru2y787f
I0903 16:36:04.241193 23456247946112 torch/distributed/launcher/api.py:188]   metrics_cfg      : {}
I0903 16:36:04.241193 23456247946112 torch/distributed/launcher/api.py:188]
I0903 16:36:04.244791 23456247946112 torch/distributed/elastic/agent/server/api.py:866] [default] starting workers for entrypoint: sevenn
I0903 16:36:04.244893 23456247946112 torch/distributed/elastic/agent/server/api.py:699] [default] Rendezvous'ing worker group
I0903 16:36:04.240921 23456247946112 torch/distributed/launcher/api.py:188]   max_restarts     : 0
I0903 16:36:04.240921 23456247946112 torch/distributed/launcher/api.py:188]   monitor_interval : 5
I0903 16:36:04.240921 23456247946112 torch/distributed/launcher/api.py:188]   log_dir          : /tmp/user1/torchelastic_y3po10lw
I0903 16:36:04.240921 23456247946112 torch/distributed/launcher/api.py:188]   metrics_cfg      : {}
I0903 16:36:04.240921 23456247946112 torch/distributed/launcher/api.py:188]
I0903 16:36:05.244767 23456247946112 torch/distributed/elastic/agent/server/api.py:866] [default] starting workers for entrypoint: sevenn
I0903 16:36:05.245002 23456247946112 torch/distributed/elastic/agent/server/api.py:699] [default] Rendezvous'ing worker group
I0903 16:36:06.451372 23456247946112 torch/distributed/elastic/agent/server/api.py:568] [default] Rendezvous complete for workers. Result:
I0903 16:36:06.451372 23456247946112 torch/distributed/elastic/agent/server/api.py:568]   restart_count=0
I0903 16:36:06.451372 23456247946112 torch/distributed/elastic/agent/server/api.py:568]   master_addr=gpunode123.gpuclstr.hpc.net
I0903 16:36:06.451372 23456247946112 torch/distributed/elastic/agent/server/api.py:568]   master_port=46677
I0903 16:36:06.451372 23456247946112 torch/distributed/elastic/agent/server/api.py:568]   group_rank=0
I0903 16:36:06.451372 23456247946112 torch/distributed/elastic/agent/server/api.py:568]   group_world_size=2
I0903 16:36:06.451372 23456247946112 torch/distributed/elastic/agent/server/api.py:568]   local_ranks=[0]
I0903 16:36:06.451372 23456247946112 torch/distributed/elastic/agent/server/api.py:568]   role_ranks=[0]
I0903 16:36:06.451372 23456247946112 torch/distributed/elastic/agent/server/api.py:568]   global_ranks=[0]
I0903 16:36:06.451372 23456247946112 torch/distributed/elastic/agent/server/api.py:568]   role_world_sizes=[2]
I0903 16:36:06.451372 23456247946112 torch/distributed/elastic/agent/server/api.py:568]   global_world_sizes=[2]
I0903 16:36:06.451372 23456247946112 torch/distributed/elastic/agent/server/api.py:568]
I0903 16:36:06.451587 23456247946112 torch/distributed/elastic/agent/server/api.py:707] [default] Starting worker group
I0903 16:36:06.451759 23456247946112 torch/distributed/elastic/agent/server/local_elastic_agent.py:168] Environment variable 'TORCHELASTIC_ENABLE_FILE_TIMER' not found. Do not start FileTimerServer.
I0903 16:36:06.451946 23456247946112 torch/distributed/elastic/multiprocessing/api.py:263] log directory set to: /tmp/user1/torchelastic_ru2y787f/28673_4b88ipix
I0903 16:36:06.452133 23456247946112 torch/distributed/elastic/multiprocessing/api.py:358] Setting worker0 reply file to: /tmp/user1/torchelastic_ru2y787f/28673_4b88ipix/attempt_0/0/error.json
I0903 16:36:06.455019 23456247946112 torch/distributed/elastic/agent/server/api.py:568] [default] Rendezvous complete for workers. Result:
I0903 16:36:06.455019 23456247946112 torch/distributed/elastic/agent/server/api.py:568]   restart_count=0
I0903 16:36:06.455019 23456247946112 torch/distributed/elastic/agent/server/api.py:568]   master_addr=gpunode123.gpuclstr.hpc.net
I0903 16:36:06.455019 23456247946112 torch/distributed/elastic/agent/server/api.py:568]   master_port=46677
I0903 16:36:06.455019 23456247946112 torch/distributed/elastic/agent/server/api.py:568]   group_rank=1
I0903 16:36:06.455019 23456247946112 torch/distributed/elastic/agent/server/api.py:568]   group_world_size=2
I0903 16:36:06.455019 23456247946112 torch/distributed/elastic/agent/server/api.py:568]   local_ranks=[0]
I0903 16:36:06.455019 23456247946112 torch/distributed/elastic/agent/server/api.py:568]   role_ranks=[1]
I0903 16:36:06.455019 23456247946112 torch/distributed/elastic/agent/server/api.py:568]   global_ranks=[1]
I0903 16:36:06.455019 23456247946112 torch/distributed/elastic/agent/server/api.py:568]   role_world_sizes=[2]
I0903 16:36:06.455019 23456247946112 torch/distributed/elastic/agent/server/api.py:568]   global_world_sizes=[2]
I0903 16:36:06.455019 23456247946112 torch/distributed/elastic/agent/server/api.py:568]
I0903 16:36:06.455331 23456247946112 torch/distributed/elastic/agent/server/api.py:707] [default] Starting worker group
I0903 16:36:06.455811 23456247946112 torch/distributed/elastic/agent/server/local_elastic_agent.py:168] Environment variable 'TORCHELASTIC_ENABLE_FILE_TIMER' not found. Do not start FileTimerServer.
I0903 16:36:06.455995 23456247946112 torch/distributed/elastic/multiprocessing/api.py:263] log directory set to: /tmp/user1/torchelastic_y3po10lw/28673_mdw_u3mo
I0903 16:36:06.456144 23456247946112 torch/distributed/elastic/multiprocessing/api.py:358] Setting worker0 reply file to: /tmp/user1/torchelastic_y3po10lw/28673_mdw_u3mo/attempt_0/0/error.json
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/user1/venv_sevennet_gpu_cuda118_pytorch230/bin/sevenn", line 8, in <module>
[rank0]:     sys.exit(main())
[rank0]:   File "/home/user1/venv_sevennet_gpu_cuda118_pytorch230/lib/python3.9/site-packages/sevenn/main/sevenn.py", line 85, in main
[rank0]:     train(global_config, working_dir)
[rank0]:   File "/home/user1/venv_sevennet_gpu_cuda118_pytorch230/lib/python3.9/site-packages/sevenn/scripts/train.py", line 57, in train
[rank0]:     state_dicts, start_epoch, init_csv = processing_continue(config)
[rank0]:   File "/home/user1/venv_sevennet_gpu_cuda118_pytorch230/lib/python3.9/site-packages/sevenn/scripts/processing_continue.py", line 75, in processing_continue
[rank0]:     check_config_compatible(config, config_cp)
[rank0]:   File "/home/user1/venv_sevennet_gpu_cuda118_pytorch230/lib/python3.9/site-packages/sevenn/scripts/processing_continue.py", line 49, in check_config_compatible
[rank0]:     raise ValueError(
[rank0]: ValueError: reset optimizer and scheduler if you want to change trainable configs
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/user1/venv_sevennet_gpu_cuda118_pytorch230/bin/sevenn", line 8, in <module>
[rank1]:     sys.exit(main())
[rank1]:   File "/home/user1/venv_sevennet_gpu_cuda118_pytorch230/lib/python3.9/site-packages/sevenn/main/sevenn.py", line 85, in main
[rank1]:     train(global_config, working_dir)
[rank1]:   File "/home/user1/venv_sevennet_gpu_cuda118_pytorch230/lib/python3.9/site-packages/sevenn/scripts/train.py", line 57, in train
[rank1]:     state_dicts, start_epoch, init_csv = processing_continue(config)
[rank1]:   File "/home/user1/venv_sevennet_gpu_cuda118_pytorch230/lib/python3.9/site-packages/sevenn/scripts/processing_continue.py", line 75, in processing_continue
[rank1]:     check_config_compatible(config, config_cp)
[rank1]:   File "/home/user1/venv_sevennet_gpu_cuda118_pytorch230/lib/python3.9/site-packages/sevenn/scripts/processing_continue.py", line 49, in check_config_compatible
[rank1]:     raise ValueError(
[rank1]: ValueError: reset optimizer and scheduler if you want to change trainable configs
E0903 16:36:16.466846 23456247946112 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 891499) of binary: sevenn
E0903 16:36:16.470759 23456247946112 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 910583) of binary: sevenn
I0903 16:36:16.473012 23456247946112 torch/distributed/elastic/multiprocessing/errors/__init__.py:360] ('local_rank %s FAILED with no error file. Decorate your entrypoint fn with @record for traceback info. See: https://pytorch.org/docs/stable/elastic/errors.html', 0)
Traceback (most recent call last):
  File "/home/user1/venv_sevennet_gpu_cuda118_pytorch230/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/user1/venv_sevennet_gpu_cuda118_pytorch230/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/home/user1/venv_sevennet_gpu_cuda118_pytorch230/lib/python3.9/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/home/user1/venv_sevennet_gpu_cuda118_pytorch230/lib/python3.9/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/user1/venv_sevennet_gpu_cuda118_pytorch230/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/user1/venv_sevennet_gpu_cuda118_pytorch230/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
sevenn FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-09-03_16:36:16
  host      : gpunode123.gpuclstr.hpc.net
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 891499)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
I0903 16:36:16.476574 23456247946112 torch/distributed/elastic/multiprocessing/errors/__init__.py:360] ('local_rank %s FAILED with no error file. Decorate your entrypoint fn with @record for traceback info. See: https://pytorch.org/docs/stable/elastic/errors.html', 1)
Traceback (most recent call last):
  File "/home/user1/venv_sevennet_gpu_cuda118_pytorch230/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/user1/venv_sevennet_gpu_cuda118_pytorch230/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/home/user1/venv_sevennet_gpu_cuda118_pytorch230/lib/python3.9/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/home/user1/venv_sevennet_gpu_cuda118_pytorch230/lib/python3.9/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/user1/venv_sevennet_gpu_cuda118_pytorch230/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/user1/venv_sevennet_gpu_cuda118_pytorch230/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
sevenn FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-09-03_16:36:16
  host      : gpunode127.gpuclstr.hpc.net
  rank      : 1 (local_rank: 0)
  exitcode  : 1 (pid: 910583)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
srun: error: gpunode127: task 1: Exited with exit code 1
srun: error: gpunode123: task 0: Exited with exit code 1

turbosonics commented 2 months ago

Geometry I used is SevenNet graph data file. Geometry is originally calculated from VASP AIMD. I changed it to extxyz, then I converted it to SevenNet graph to use in this training.

I tried 10000 images and 20000 images cases, both of them showed the same symptom: The first fine-tune training (based on MP pretrained model) runs well with 2 GPUs for a week. But the restart/continuing second fine-tune training crashes with the same Child Failed error.

The current SevenNet that I compiled from virtual environment is July 23 2024 version. Are there any updated version since then?

Also, do you think custom built pytorch, instead of prebuilt pytorch, would be helpful to crack this error?

YutackPark commented 2 months ago

It is bug in SevenNet. The below part

[rank0]: File "/home/user1/venv_sevennet_gpu_cuda118_pytorch230/lib/python3.9/site-packages/sevenn/scripts/processing_continue.py", line 49, in check_config_compatible [rank0]: raise ValueError( [rank0]: ValueError: reset optimizer and scheduler if you want to change trainable configs

should not raise an error as you didn't change your trainable configs (train_denominator, train_shift_scale).

I will prepare fix for this case. Sorry for wasting your time. I'll let you know if there is quick bypass to this problem, preserving your intent.

If you don't mind, you may set reset_optimizer=True and reset_scheduler=True to force the training restart.

YutackPark commented 2 months ago

Just updated the main branch fixing this issue. It seems good on my machine.

If you're not familiar with git, you can update SevenNet like

cd {SevenNet folder}
git fetch
git pull
pip install .

Then you should able to continue your training.

The bad news is that: the config train_denominator=True was ignored, silently. It will work as it intended from now. However, from my experiences, fully converged potential gives almost the same accuracy regardless of train_denominator or train_shift_scale. Sorry for the bug anyway.

turbosonics commented 2 months ago

Updated the SevenNet, then changed reset_optimizer and reset_scheduler as True for restart/continuation job, then submitted. The retsart/continuation Job runs fine so far. Thanks! I will report again if something comes up again.

MDIL-SNU / SevenNet

ChildFailedError when continuing the fine-tune training based on MP pretrained model using 2 GPUs. #83