Open olccihyeon opened 1 week ago
Also, did you also use --multi_task_fusion True in your actual experiment?
Did the impact of ALL BATCH SIZE affect the results?
I'm using 4 gpu's on 1 a100 node, 80 batchsize each, and I'm getting less performance than paper.
Thank you for your answer
Did you use the first-stage weights I provided for the second-stage training? Given that we employ cross-device for all samples, the total batch size in contrastive learning indeed has a significant impact on the model performance.
In addition, we do not recommend training the second stage for too many steps. Excessive steps can easily lead to overfitting on the VISTA-S2 dataset, thereby affecting the performance of zero-shot evaluation.
Thank you for your answer! Now I understand it properly, thank you
I have a couple of questions.
on the paper you say you used 1920 batchsize when experimenting with stage 2, can you tell me how many gpu nodes you actually used and how many gpu per node?
looking at the bash file
!/bin/bash
env
gpus_per_node=8
Change for multinode config
MASTER_ADDR= # your machine address MASTER_PORT=661 NNODES=1 NODE_RANK=0 world_size=$(($gpus_per_node*$nnodes))
DATA_PATH= # your training data SAVE_PATH= # your saving path IMAGE_PATH= # your image path
EPOCH=5 RESUME_PATH= # the checkpoint path for initializing model SAVE_STEPS=100 GROUP_SIZE=4 # = one (positive sample) + number (of hard negative samples) BSZ_PERGPU=80 # batch size per gpu LR=2e-5
Training_Dir= #your training dir cd $Training_Dir
Data and model
mkdir $SAVE_PATH DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT”
export LAUNCHER="torchrun \ distributed_args \ ”
full_options=” --output_dir $SAVE_PATH \ --bge_model_name_path \ --bge_model_name_or_path BAAI/bge-base-en-v1.5 \ --visual_model_name_or_path EVA02-CLIP-B-16 \ --dataloader_num_workers 1 \ --train_data $DATA_PATH \ --train_data_image $IMAGE_PATH \ --train_group_size $GROUP_SIZE --learning_rate $LR \ --fp16 \ --per_device_train_batch_size $BSZ_PERGPU \ --dataloader_drop_last True \ --normlized True \ --temperature 0.02 \ --logging_steps 10 \ --num_train_epochs $EPOCH \ --negatives_cross_device \ --train_text_tower False \ --train_vision_tower True \ --resume_path $RESUME_PATH \ --save_steps $SAVE_STEPS \ --deepspeed ./EVA-CLIP/rei/training/deepspeed_config.json \ --gradient_checkpointing \ ”
run_cmd="$LAUNCHER -m finetune.run_stage2_fusion ${full_options}” echo ${run_cmd} eval ${run_cmd} 2>&1 | tee $SAVEPATH/output$NODE_RANK.log
set +x
What did we actually use deepspeed and gradient_checkpointing for here?
The config of deepspeed published in the Flagembedding repo and the batch size part of the corresponding bash file are slightly different, and the
gradient_checkpointing, you get an error that the parameter is updated twice.
Could you please give me the exact bash file you actually used for that code?
Thank you for your time.
Translated with DeepL.com (free version)