Support automatic resumption in HPC environments

carbonscott commented 1 month ago

Proposal:

At the end of every data segment, save the training checkpoint file/directory name to a file, informing the scheduler (e.g., slurm or lsf) where to find the checkpoint.
Save the model checkpoint when the minimum loss is observed.
I might not need to checkpoint the training when an interruption signal occurs (e.g., due to job scheduler time limits or preemption).

carbonscott commented 1 month ago

Test run for 24:00 hours on the killable queue.

yaml

checkpoint:
  chkpt_saving_period: 1
  preempt_chkpt_saving_period: 1
  directory: experiments/chkpts
  prefix: preempt-run
  path_chkpt_prev: null
  pretrain: null
dataset:
  batch_size: 1
  num_workers: 1
  path_train: experiments/datasets/dataset.train.json
  path_eval: experiments/datasets/dataset.eval.json
  seg_size: 4
  entry_per_cycle: 1
  debug: true
  server_address:
  - localhost
  - 5000
  transforms:
    norm:
      Rayonix:
        mean: 116.92
        std: 22.89
      epix10k2M:
        mean: 46.6
        std: 98.3
      jungfrau4M:
        mean: 593.17
        std: 204.13
    H_pad: 2048
    W_pad: 2048
    num_patch: 100
    size_patch: 20
    angle_max: 360
    frac_shift_max: 0.1
    downscale_factors:
    - 2
    - 2
    var_size_patch: 0.2
    patch_size: 224
    stride: 224
dist:
  backend: nccl
  uses_unique_world_seed: true
  dtype: float16
logging:
  directory: experiments/logs
  prefix: preempt-run
  level: debug
loss:
  grad_accum_steps: 4
lr_scheduler:
  min_lr: 1.0e-07
  total_iterations: 1000000
  uses_prev: true
  warmup_iterations: 5
  scheduler_step_period: 50
misc:
  max_epochs: 5
  max_eval_iter: 2
  max_eval_retry: 2
  num_gpus: 6
  uses_mixed_precision: true
  compiles_model: false
  data_dump_on: false
model:
  name: facebook/vit-mae-base
optim:
  grad_clip: 1.0
  lr: 0.0002
  weight_decay: 0.0001

# THIS SCRIPT IS GENERATED BY EXECUTING:
# python launch_job.py train_config.dataset.path_train=experiments/datasets/dataset.train.json train_config.dataset.path_eval=experiments/datasets/dataset.eval.json train_config.misc.num_gpus=6 train_config.dataset.batch_size=1 train_config.dataset.num_workers=1 train_config.loss.grad_accum_steps=4 train_config.model.name=facebook/vit-mae-base train_config.dataset.seg_size=4 train_config.misc.max_eval_iter=2 train_config.lr_scheduler.scheduler_step_period=50 train_config.misc.data_dump_on=false train_config.checkpoint.prefix=preempt-run train_config.logging.prefix=preempt-run job=preempt-run bsub_config.ipc_workers=2 bsub_config.qos=killable bsub_config.walltime=24:00 bsub_config.num_nodes=10 bsub_config.trainer=train.fsdp.py train_config.dataset.entry_per_cycle=1 train_config.dataset.debug=true auto_submit=true bsub_config.num_cpus_for_client=4 train_config.checkpoint.path_chkpt_prev=null

bsub

#!/bin/bash
#BSUB -o lsf/%J.log
#BSUB -e lsf/%J.err
#BSUB -q killable
#BSUB -W 24:00
#BSUB -P <PROJECT_ID>
#BSUB -J preempt-run
#BSUB -nnodes 10

# Set up the Huggingface's cache directory
export TRANSFORMERS_CACHE=$HOME/.cache/huggingface

export http_proxy=http://proxy.ccs.ornl.gov:3128/
export https_proxy=http://proxy.ccs.ornl.gov:3128/

export OMP_NUM_THREADS=1
export NCCL_DEBUG=INFO
export TORCH_NCCL_BLOCKING_WAIT=1

# Set up a meta checkpoint file
export PREEMPT_ROOT="preempt"
mkdir -p $PREEMPT_ROOT
export PREEMPT_METADATA_PATH="$PREEMPT_ROOT/preempt-run.dat"

# Check if a checkpoint exists and resume from it
if [ -f $PREEMPT_METADATA_PATH ]; then
    echo "Resuming from preemptive checkpoint..."
    python -c "import yaml, os; data = yaml.safe_load(open('experiments/yaml/preempt-run.yaml')); data['checkpoint']['path_chkpt_prev'] = open(os.getenv('PREEMPT_METADATA_PATH')).read().strip(); yaml.safe_dump(data, open('experiments/yaml/preempt-run.yaml', 'w'))"
fi

# Fetch all nodes and output a whole string of concatenated host nodes
# $LSB_MCPU_HOSTS gives something like "batch02 1 a09n03 42 a10n04 42".
# I need just "a09n03 a10n04" to set up a head node.
nodelist=$(echo $LSB_MCPU_HOSTS | awk '{for (i=3; i<=NF; i+=2) print $i}' | sort | uniq)    # "a09n03 a10n04"
read -r -a nodes <<< "$nodelist"
head_node=${nodes[0]}
head_node_ip=$(ssh "$head_node" hostname --ip-address)
head_node_ip=$(echo "$head_node_ip" | awk '{print $1}')

echo Node IP: $head_node_ip
export LOGLEVEL=INFO

echo "Starting server..."
jsrun \
--tasks_per_rs 1 \
--cpu_per_rs 6 \
--gpu_per_rs 0 \
--rs_per_host 1 \
--latency_priority cpu-cpu \
--launch_distribution packed \
python server.ipc.py --num_workers 2 &

sleep 10

echo "Running client..."
jsrun \
--rs_per_host 6 \
--tasks_per_rs 1 \
--cpu_per_rs 4 \
--gpu_per_rs 1 \
--latency_priority gpu-gpu \
--launch_distribution packed \
python train.fsdp.py experiments/yaml/preempt-run.yaml

# Kill all running applications (e.g. the ipc server)
jskill all
# THIS SCRIPT IS GENERATED BY EXECUTING:
# python launch_job.py train_config.dataset.path_train=experiments/datasets/dataset.train.json train_config.dataset.path_eval=experiments/datasets/dataset.eval.json train_config.misc.num_gpus=6 train_config.dataset.batch_size=1 train_config.dataset.num_workers=1 train_config.loss.grad_accum_steps=4 train_config.model.name=facebook/vit-mae-base train_config.dataset.seg_size=4 train_config.misc.max_eval_iter=2 train_config.lr_scheduler.scheduler_step_period=50 train_config.misc.data_dump_on=false train_config.checkpoint.prefix=preempt-run train_config.logging.prefix=preempt-run job=preempt-run bsub_config.ipc_workers=2 bsub_config.qos=killable bsub_config.walltime=24:00 bsub_config.num_nodes=10 bsub_config.trainer=train.fsdp.py train_config.dataset.entry_per_cycle=1 train_config.dataset.debug=true auto_submit=true bsub_config.num_cpus_for_client=4 train_config.checkpoint.path_chkpt_prev=null

carbonscott commented 1 month ago

Sample log during resumption:

05/28/2024 00:54:37 INFO __main__
Loading from checkpoint -- experiments/chkpts/preempt-run.2024_0527_1004_49.preempt.
05/28/2024 00:54:37 INFO __main__
PREV - last_epoch 0, last_seg 1440-1680, loss_min = 571.4844360351562
05/28/2024 00:54:37 DEBUG __main__
[RANK 0] Ready for training loop...
05/28/2024 00:54:37 DEBUG maxie.datasets.ipc_segmented_dataset_dist
[RANK 0] Setting start idx to 1680.
05/28/2024 00:54:37 DEBUG maxie.datasets.ipc_segmented_dataset_dist
[RANK 0] Initializing entry generator.
05/28/2024 00:54:37 DEBUG maxie.datasets.ipc_segmented_dataset_dist
[RANK 0] Updating segment to 1680-1920.

carbonscott commented 1 month ago

Encountered the following error once (less lsf/3476339.err):

store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
return TCPStore(
TimeoutError: The client socket has timed out after 1800s while trying to connect to (h36n17, 8888)

I suspect it's a connection issue with the head node (h36n17).

carbonscott commented 1 month ago

The killable queue seems to be problematic (again, I suspect it's a connection issue). Use the batch queue with the following monitoring script https://github.com/carbonscott/maxie/blob/main/train/monitor_and_resubmit.sh as a workaround for now.

carbonscott / maxie

Support automatic resumption in HPC environments #12

yaml

bsub