OpenGVLab / VideoMamba

VideoMamba: State Space Model for Efficient Video Understanding
https://arxiv.org/abs/2403.06977
Apache License 2.0
660 stars 47 forks source link

Difficulty Encountered While Fine-Tuning Model on Breakfast Dataset and Questions About Folder Structure #35

Closed franklio closed 2 months ago

franklio commented 2 months ago

Thank you for developing such an inspiring model. However, while attempting the Video Understanding task on the breakfast dataset, I encountered a specific issue.

I adjusted the PREFIX and DATA_PATH in exp/breakfast/videomamba_middle_mask/run_f32x224.sh to match my local breakfast dataset file paths (downloaded from here) and executed bash ./exp/breakfast/videomamba_middle_mask/run_f32x224.sh. Unfortunately, it resulted in the following error:

srun: error: invalid partition specified: video5
srun: error: Unable to allocate resources: Invalid partition name specified

I suspect there might be an issue with the file or directory format. Could you please provide more details regarding the video data, particularly for the breakfast dataset, and clarify the location for placing the weight files? Your assistance would be greatly appreciated!

Andy1621 commented 2 months ago

Hi! Our scripts use srun to start the training. If you do not have srun, please remove it and simply run code like

export MASTER_PORT=$((12000 + $RANDOM % 20000))
export OMP_NUM_THREADS=1

JOB_NAME='videomamba_tiny_f32_res224'
OUTPUT_DIR="$(dirname $0)/$JOB_NAME"
LOG_DIR="./logs/${JOB_NAME}"
PREFIX='your_breakfast_path'
DATA_PATH='your_breakfast_metadata_path'

PARTITION='video5'
GPUS=8
GPUS_PER_NODE=8
CPUS_PER_TASK=16

python run_class_finetuning.py \
        --model videomamba_tiny \
        --finetune your_model_path/videomamba_t16_k400_f32_res224.pth \
        --data_path ${DATA_PATH} \
        --prefix ${PREFIX} \
        --data_set 'Kinetics_sparse' \
        --split ',' \
        --nb_classes 10 \
        --log_dir ${OUTPUT_DIR} \
        --output_dir ${OUTPUT_DIR} \
        --batch_size 32 \
        --num_sample 2 \
        --input_size 224 \
        --short_side_size 224 \
        --save_ckpt_freq 100 \
        --num_frames 32 \
        --orig_t_size 32 \
        --num_workers 12 \
        --warmup_epochs 5 \
        --tubelet_size 1 \
        --epochs 70 \
        --lr 2e-4 \
        --drop_path 0.1 \
        --aa rand-m5-n2-mstd0.25-inc1 \
        --opt adamw \
        --opt_betas 0.9 0.999 \
        --weight_decay 0.1 \
        --test_num_segment 4 \
        --test_num_crop 3 \
        --dist_eval \
        --test_best \
        --bf16
franklio commented 2 months ago

Thank you for your prompt response. I have this piece of code, and I simply followed the instructions in the GitHub documentation to change PREFIX and DATA_PATH to my own data path because I wasn't sure how to proceed. Here it is:

PREFIX='/home/simslab-n/Downloads/BreakfastII_15fps_qvga_sync'
DATA_PATH='/home/simslab-n/Downloads/BreakfastII_15fps_qvga_sync'

Thanks again for your assistance.

Andy1621 commented 2 months ago

The PREFIX is used to set your data path, like what you did. The DATA_PATH is used to set your annotation path, and I have uploaded it here.

franklio commented 2 months ago

Hi there,

Thank you so much for the assistance you provided earlier. I've been trying for the past few days and indeed managed to resolve the previous issue. However, I've run into some difficulties again during execution, and here's the error code:

optimizer settings: {'lr': 3.125e-06, 'weight_decay': 0.0, 'eps': 1e-08, 'betas': [0.9, 0.999]}
Use bf16: True
Use step level LR scheduler!
Set warmup steps = 3945
Set warmup steps = 0
Max WD = 0.1000000, Min WD = 0.1000000
criterion = SoftTargetCrossEntropy()
Auto resume checkpoint: 
Start training for 10 epochs
Traceback (most recent call last):
  File "/home/simslab-n/Desktop/VideoMamba/videomamba/video_sm/run_class_finetuning.py", line 713, in <module>
    main(opts, ds_init)
  File "/home/simslab-n/Desktop/VideoMamba/videomamba/video_sm/run_class_finetuning.py", line 630, in main
    train_stats = train_one_epoch(
  File "/home/simslab-n/Desktop/VideoMamba/videomamba/video_sm/engines/engine_for_finetuning.py", line 82, in train_one_epoch
    loss_list = [torch.zeros_like(loss) for _ in range(dist.get_world_size())]
  File "/home/simslab-n/miniconda3/envs/Mamba/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1555, in get_world_size
    return _get_group_size(group)
  File "/home/simslab-n/miniconda3/envs/Mamba/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 836, in _get_group_size
    default_pg = _get_default_group()
  File "/home/simslab-n/miniconda3/envs/Mamba/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 977, in _get_default_group
    raise ValueError(
ValueError: Default process group has not been initialized, please make sure to call init_process_group.

I have tried to fine-tune videomamba on single GPU, I attempted to add the line torch.distributed.init_process_group('nccl',init_method='file:///my_file',world_size=1,rank=0) before the line model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu], find_unused_parameters=True)( in run_class_finetuning.py line 562). But it seems like it didn't have any effect. Below is the .sh file I've modified.

export MASTER_PORT=$((12000 + $RANDOM % 20000))
export OMP_NUM_THREADS=1

JOB_NAME='videomamba_tiny_f32_res224'
OUTPUT_DIR="$(dirname $0)/$JOB_NAME"
LOG_DIR="./logs/${JOB_NAME}"
# PREFIX='your_breakfast_path'
# DATA_PATH='your_breakfast_metadata_path'
PREFIX='/home/simslab-n/Desktop/VideoMamba/videomamba/video_sm/breakfast_video'
DATA_PATH='/home/simslab-n/Desktop/VideoMamba/videomamba/video_sm/breakfast_label'

PARTITION='video5'
GPUS=1
GPUS_PER_NODE=1
CPUS_PER_TASK=16

python run_class_finetuning.py \
        --model videomamba_tiny \
        --finetune /home/simslab-n/Desktop/VideoMamba/videomamba/video_sm/videomamba_t16_breakfast_f32_res224.pth\
        --data_path ${DATA_PATH} \
        --prefix ${PREFIX} \
        --data_set 'Kinetics_sparse' \
        --split ',' \
        --nb_classes 10 \
        --log_dir ${OUTPUT_DIR} \
        --output_dir ${OUTPUT_DIR} \
        --batch_size 2\
        --num_sample 2 \
        --input_size 224 \
        --short_side_size 224 \
        --save_ckpt_freq 100 \
        --num_frames 32 \
        --orig_t_size 32 \
        --num_workers 12 \
        --warmup_epochs 5 \
        --tubelet_size 1 \
        --epochs 10 \
        --lr 2e-4 \
        --drop_path 0.1 \
        --aa rand-m5-n2-mstd0.25-inc1 \
        --opt adamw \
        --opt_betas 0.9 0.999 \
        --weight_decay 0.1 \
        --test_num_segment 4 \
        --test_num_crop 3 \
        --dist_eval \
        --test_best \
        --bf16

Could you please take a look and see if you can spot anything that might be causing the issue? Your continued assistance would be greatly appreciated!

franklio commented 2 months ago

I've managed to solve the issue by adding the following code!

import torch.distributed as dist
dist.init_process_group(backend='nccl', init_method='tcp://localhost:23456', rank=0, world_size=1)