Closed franklio closed 2 months ago
Hi! Our scripts use srun
to start the training. If you do not have srun
, please remove it and simply run code like
export MASTER_PORT=$((12000 + $RANDOM % 20000))
export OMP_NUM_THREADS=1
JOB_NAME='videomamba_tiny_f32_res224'
OUTPUT_DIR="$(dirname $0)/$JOB_NAME"
LOG_DIR="./logs/${JOB_NAME}"
PREFIX='your_breakfast_path'
DATA_PATH='your_breakfast_metadata_path'
PARTITION='video5'
GPUS=8
GPUS_PER_NODE=8
CPUS_PER_TASK=16
python run_class_finetuning.py \
--model videomamba_tiny \
--finetune your_model_path/videomamba_t16_k400_f32_res224.pth \
--data_path ${DATA_PATH} \
--prefix ${PREFIX} \
--data_set 'Kinetics_sparse' \
--split ',' \
--nb_classes 10 \
--log_dir ${OUTPUT_DIR} \
--output_dir ${OUTPUT_DIR} \
--batch_size 32 \
--num_sample 2 \
--input_size 224 \
--short_side_size 224 \
--save_ckpt_freq 100 \
--num_frames 32 \
--orig_t_size 32 \
--num_workers 12 \
--warmup_epochs 5 \
--tubelet_size 1 \
--epochs 70 \
--lr 2e-4 \
--drop_path 0.1 \
--aa rand-m5-n2-mstd0.25-inc1 \
--opt adamw \
--opt_betas 0.9 0.999 \
--weight_decay 0.1 \
--test_num_segment 4 \
--test_num_crop 3 \
--dist_eval \
--test_best \
--bf16
Thank you for your prompt response. I have this piece of code, and I simply followed the instructions in the GitHub documentation to change PREFIX and DATA_PATH to my own data path because I wasn't sure how to proceed. Here it is:
PREFIX='/home/simslab-n/Downloads/BreakfastII_15fps_qvga_sync'
DATA_PATH='/home/simslab-n/Downloads/BreakfastII_15fps_qvga_sync'
Thanks again for your assistance.
The PREFIX
is used to set your data path, like what you did. The DATA_PATH
is used to set your annotation path, and I have uploaded it here.
Hi there,
Thank you so much for the assistance you provided earlier. I've been trying for the past few days and indeed managed to resolve the previous issue. However, I've run into some difficulties again during execution, and here's the error code:
optimizer settings: {'lr': 3.125e-06, 'weight_decay': 0.0, 'eps': 1e-08, 'betas': [0.9, 0.999]}
Use bf16: True
Use step level LR scheduler!
Set warmup steps = 3945
Set warmup steps = 0
Max WD = 0.1000000, Min WD = 0.1000000
criterion = SoftTargetCrossEntropy()
Auto resume checkpoint:
Start training for 10 epochs
Traceback (most recent call last):
File "/home/simslab-n/Desktop/VideoMamba/videomamba/video_sm/run_class_finetuning.py", line 713, in <module>
main(opts, ds_init)
File "/home/simslab-n/Desktop/VideoMamba/videomamba/video_sm/run_class_finetuning.py", line 630, in main
train_stats = train_one_epoch(
File "/home/simslab-n/Desktop/VideoMamba/videomamba/video_sm/engines/engine_for_finetuning.py", line 82, in train_one_epoch
loss_list = [torch.zeros_like(loss) for _ in range(dist.get_world_size())]
File "/home/simslab-n/miniconda3/envs/Mamba/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1555, in get_world_size
return _get_group_size(group)
File "/home/simslab-n/miniconda3/envs/Mamba/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 836, in _get_group_size
default_pg = _get_default_group()
File "/home/simslab-n/miniconda3/envs/Mamba/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 977, in _get_default_group
raise ValueError(
ValueError: Default process group has not been initialized, please make sure to call init_process_group.
I have tried to fine-tune videomamba on single GPU, I attempted to add the line torch.distributed.init_process_group('nccl',init_method='file:///my_file',world_size=1,rank=0)
before the line model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu], find_unused_parameters=True)
( in run_class_finetuning.py
line 562). But it seems like it didn't have any effect. Below is the .sh
file I've modified.
export MASTER_PORT=$((12000 + $RANDOM % 20000))
export OMP_NUM_THREADS=1
JOB_NAME='videomamba_tiny_f32_res224'
OUTPUT_DIR="$(dirname $0)/$JOB_NAME"
LOG_DIR="./logs/${JOB_NAME}"
# PREFIX='your_breakfast_path'
# DATA_PATH='your_breakfast_metadata_path'
PREFIX='/home/simslab-n/Desktop/VideoMamba/videomamba/video_sm/breakfast_video'
DATA_PATH='/home/simslab-n/Desktop/VideoMamba/videomamba/video_sm/breakfast_label'
PARTITION='video5'
GPUS=1
GPUS_PER_NODE=1
CPUS_PER_TASK=16
python run_class_finetuning.py \
--model videomamba_tiny \
--finetune /home/simslab-n/Desktop/VideoMamba/videomamba/video_sm/videomamba_t16_breakfast_f32_res224.pth\
--data_path ${DATA_PATH} \
--prefix ${PREFIX} \
--data_set 'Kinetics_sparse' \
--split ',' \
--nb_classes 10 \
--log_dir ${OUTPUT_DIR} \
--output_dir ${OUTPUT_DIR} \
--batch_size 2\
--num_sample 2 \
--input_size 224 \
--short_side_size 224 \
--save_ckpt_freq 100 \
--num_frames 32 \
--orig_t_size 32 \
--num_workers 12 \
--warmup_epochs 5 \
--tubelet_size 1 \
--epochs 10 \
--lr 2e-4 \
--drop_path 0.1 \
--aa rand-m5-n2-mstd0.25-inc1 \
--opt adamw \
--opt_betas 0.9 0.999 \
--weight_decay 0.1 \
--test_num_segment 4 \
--test_num_crop 3 \
--dist_eval \
--test_best \
--bf16
Could you please take a look and see if you can spot anything that might be causing the issue? Your continued assistance would be greatly appreciated!
I've managed to solve the issue by adding the following code!
import torch.distributed as dist
dist.init_process_group(backend='nccl', init_method='tcp://localhost:23456', rank=0, world_size=1)
Thank you for developing such an inspiring model. However, while attempting the Video Understanding task on the breakfast dataset, I encountered a specific issue.
I adjusted the PREFIX and DATA_PATH in
exp/breakfast/videomamba_middle_mask/run_f32x224.sh
to match my local breakfast dataset file paths (downloaded from here) and executedbash ./exp/breakfast/videomamba_middle_mask/run_f32x224.sh
. Unfortunately, it resulted in the following error:I suspect there might be an issue with the file or directory format. Could you please provide more details regarding the video data, particularly for the breakfast dataset, and clarify the location for placing the weight files? Your assistance would be greatly appreciated!