MCG-NJU / VideoMAE

[NeurIPS 2022 Spotlight] VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
https://arxiv.org/abs/2203.12602
Other
1.37k stars 136 forks source link

Resume training from checkpoint #22

Open soltkreig opened 2 years ago

soltkreig commented 2 years ago

Hi! I finetuned model on my dataset, but I'd like to resume from saved checkpoint .pt. But if I start the finetuning it always begins from 0 epoch. My finetune.sh:

Set the path to save checkpoints

OUTPUT_DIR='/home/jovyan/people/Murtazin/VideoMAE/output_ckpts/eval_lr_1e-3_epoch_55'
# path to Kinetics set (train.csv/val.csv/test.csv)
DATA_PATH='/home/jovyan/datasets/sign_language/WLASL/WLASL_kinetic_hardcode'
# path to pretrain model
MODEL_PATH='/home/jovyan/people/Murtazin/VideoMAE/ckpts/checkpoint.pth' 
PT_PATH='/home/jovyan/people/Murtazin/VideoMAE/output_ckpts/eval_lr_1e-3_epoch_100/checkpoint-45/mp_rank_00_model_states.pt' 

# batch_size can be adjusted according to number of GPUs
# this script is for 64 GPUs (8 nodes x 8 GPUs)
OMP_NUM_THREADS=1 python -m torch.distributed.launch --nproc_per_node=8 \
    run_class_finetuning.py \
    --model vit_large_patch16_224 \
    --data_set WLASL \
    --nb_classes 2000 \
    --data_path ${DATA_PATH} \
    --resume ${PT_PATH} \
    --log_dir ${OUTPUT_DIR} \
    --output_dir ${OUTPUT_DIR} \
    --batch_size 2 \
    --num_sample 2 \
    --input_size 224 \
    --short_side_size 224 \
    --save_ckpt_freq 10 \
    --num_frames 32 \
    --sampling_rate 2 \
    --opt adamw \
    --lr 2e-3 \
    --opt_betas 0.9 0.999 \
    --weight_decay 0.05 \
    --epochs 55 \
    --dist_eval \
    --test_num_segment 5 \
    --test_num_crop 3 \
    --enable_deepspeed \
congee524 commented 1 year ago

with deepspeed enabled, only support '--auto_resume'. you can copy the checkpoint-45 to the work dir.