Closed tayton42 closed 6 months ago
Couple of questions:
Thanks again for using our code — hopefully we can resolve this fairly quickly!
Couple of questions:
- Can you point to the exact config you’re using to train (in the conf/ registry)? For the numbers you cite, we’re actually only running one epoch by default.
- Can you let me know what your training set up is (# of GPUs, batch size, gradient accumulation if any). We’ve noticed that training with smaller batches really hurts performance.
- Finally, can you dump the versions of PyTorch, Transformers, and Tokenizers that you’re using?
Thanks again for using our code — hopefully we can resolve this fairly quickly!
Thank you for your help!Actually,I directly use the training script configuration, below is my training script.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --standalone --nnodes 1 --nproc-per-node 8 scripts/pretrain.py \
--model.type "one-stage+7b" \
--model.model_id "dinov2siglip-patch14-rep" \
--model.arch_specifier "fused-gelu-mlp"\
--model.vision_backbone_id "dinosiglip-vit-patch14-384px" \
--model.image_resize_strategy "resize-naive" \
--model.llm_backbone_id "vicuna-v15-7b" \
--model.enable_mixed_precision_training True \
--model.align_epochs 1 \
--model.finetune_epochs 1
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --standalone --nnodes 1 --nproc-per-node 8 scripts/pretrain.py \
--stage "finetune" \
--pretrained_checkpoint "/xxx/prismatic-vlms/runs/dinov2siglip-patch14-rep+stage-align+x7/checkpoints/latest-checkpoint.pt" \
--model.finetune_per_device_batch_size 8 \
--model.type "one-stage+7b" \
--model.model_id "dinov2siglip-patch14-rep" \
--model.arch_specifier "fused-gelu-mlp"\
--model.vision_backbone_id "dinosiglip-vit-patch14-384px" \
--model.image_resize_strategy "resize-naive" \
--model.llm_backbone_id "vicuna-v15-7b" \
--model.enable_mixed_precision_training True \
--model.align_epochs 1 \
--model.finetune_epochs 2
And this is the config file I generated.
dataset:
align_stage_components:
- download/llava-laion-cc-sbu-558k/chat.json
- download/llava-laion-cc-sbu-558k
dataset_id: llava-v15
dataset_root_dir: data
finetune_stage_components:
- download/llava-v1.5-instruct/llava_v1_5_mix665k.json
- download/llava-v1.5-instruct
type: llava-v15
hf_token: /opt/cv/tianyutong/prismatic-vlms/hf_token
model:
align_epochs: 1
align_global_batch_size: 256
align_learning_rate: 0.001
align_lr_scheduler_type: linear-warmup+cosine-decay
align_max_grad_norm: 1.0
align_max_steps: null
align_per_device_batch_size: 16
align_train_strategy: fsdp-shard-grad-op
align_warmup_ratio: 0.03
align_weight_decay: 0.0
arch_specifier: fused-gelu-mlp
enable_gradient_checkpointing: true
enable_mixed_precision_training: true
finetune_epochs: 2
finetune_global_batch_size: 128
finetune_learning_rate: 2.0e-05
finetune_lr_scheduler_type: linear-warmup+cosine-decay
finetune_max_grad_norm: 1.0
finetune_max_steps: null
finetune_per_device_batch_size: 8
finetune_train_strategy: fsdp-full-shard
finetune_warmup_ratio: 0.03
finetune_weight_decay: 0.1
image_resize_strategy: resize-naive
llm_backbone_id: vicuna-v15-7b
llm_max_length: 2048
model_id: dinov2siglip-patch14-rep
reduce_in_full_precision: false
type: one-stage+7b
vision_backbone_id: dinosiglip-vit-patch14-384px
pretrained_checkpoint: /opt/cv/tianyutong/prismatic-vlms/runstemp/dinov2siglip-patch14-rep+stage-align+x7/checkpoints/latest-checkpoint.pt
run_id: dinov2siglip-patch14-rep+stage-finetune+x7
run_root_dir: runstemp
seed: 7
stage: finetune
trackers:
- jsonl
- wandb
wandb_entity: tayton
wandb_project: onyx-vlms
The configuration I used is torch==2.1.2 transformers==4.34.1 tokenizers==0.14.1
, and I trained on 8 A800 cards.
The final metrics only show that the Localization Benchmarks and vizwiz metrics are much lower than those provided in your paper, while the other metrics are normal. Could this be a problem caused by the batch size?I set the finetune_per_device_batch_size to 8.
Oh - training is super sensitive to the choice batch size; we found in early experiments that per-device batch size of 16 is minimal necessary to get stable (and good) performance. We never really dug into why beyond just that a batch size of 64 might just be too small... but that could definitely explain the discrepancy here.
per-device batch size
Even though I set the per-device batch size to 8, the global batch size is still 128. Would this still cause such a difference?
Oh - training is super sensitive to the choice batch size; we found in early experiments that per-device batch size of 16 is minimal necessary to get stable (and good) performance. We never really dug into why beyond just that a batch size of 64 might just be too small... but that could definitely explain the discrepancy here.
I also noticed that your experiment is based on training with llama2, while mine is based on training with vicuna-v15-7b. Which of these two LLMs would be better?
@siddk You should provide all the training configs properly.
@siddk-tri This much difference won't come due to batch size. Have done multiple of experiments on LLava with different batch size. It's not the case
@tayton42 - so even though the global batch size is set to 128, you'll be doing 2 steps of gradient accumulation; under FSDP mixed precision, we've noticed that leads to degraded performance like I had mentioned (I do think it's a difference due to FSDP, vs. the DeepSpeed engine that LLaVa uses for training).
Our final "Prism" configs use Llama-2 as the backbone; for the experiment you're trying to reproduce with just DINO-SigLIP, we used the original Vicuña v1.5 model.
You can see the original config (and training configs for all models in general) here: https://github.com/TRI-ML/prismatic-vlms/blob/main/prismatic/conf/models.py#L239
@siddk Hi Sidd thanks for putting up this great repo; super helpful. I have a follow-up question on this and would appreciate your thought.
Based on the discussion here my impression is that DeepSpeed might suit better if we need to use gradient accumulation (e.g. I have 46G GPUs which can only fit a per-device batch size of 8), right? If I'd like to switch from FSDP to DeepSpeed, would you mind listing a few high-level things that I need to change within the structure of your code?
@tayton42 Hello there, just curious have you managed to solve the issue here?
Hey @zjysteven; thanks so much for your interest in using our codebase. To add a DeepSpeed integration, the right places to poke around/change would be the implementation of the training strategies in [prismatic/training/strategies](https://github.com/TRI-ML/prismatic-vlms/tree/main/prismatic/training/strategies)
-- I'd look at the fsdp
strategy first, and adapt to DeepSpeed as needed.
You might also consider using HF accelerate
to rewrite the train logic so you don't have to deal with DeepSpeed yourself. In that case, you may also need to write the core training loop in base_strategy.py
.
Separately, I'm a bit swamped right now with the CoRL deadline next week. However, afterwards I'm planning on pushing some HF compatible versions of the Prismatic models (e.g., PrismaticForConditionalGeneration
) that can be used with the HF Trainer out of the box! Will update this thread as soon as I do so.
Thank you for detailed instructions! Yes after some investigation I also believe using accelerate + deepspeed would be the direction.
Hello! When I was using your code for training, I found that the metrics for refcoco are always very low, even when reproducing your DINOv2 + SigLIP 384px (Naive Resize) configuration. During finetuning, I used Vicuna v1.5 7B, training for two epochs on llava_v1_5_mix665k.json, but the metrics I got are as follows => RefCOCO Accuracy (Official): 0.566
=> RefCOCO+ Accuracy (Official): 0.499
=> RefCOCOg Accuracy (Official): 0.521
I saw that the metrics you provided are 73.86, 67.29, 67.85. I would like to know what your training configuration is. Do you finetune using llava_v1_5_lvis4v_lrv_mix1231k? If so, would these two datasets cause such a big difference?