The reproduction issue of DINOv2 + SigLIP 384px (Naive Resize)

tayton42 commented 6 months ago

Hello! When I was using your code for training, I found that the metrics for refcoco are always very low, even when reproducing your DINOv2 + SigLIP 384px (Naive Resize) configuration. During finetuning, I used Vicuna v1.5 7B, training for two epochs on llava_v1_5_mix665k.json, but the metrics I got are as follows => RefCOCO Accuracy (Official): 0.566
=> RefCOCO+ Accuracy (Official): 0.499
=> RefCOCOg Accuracy (Official): 0.521

I saw that the metrics you provided are 73.86, 67.29, 67.85. I would like to know what your training configuration is. Do you finetune using llava_v1_5_lvis4v_lrv_mix1231k? If so, would these two datasets cause such a big difference?

siddk commented 6 months ago

Couple of questions:

Can you point to the exact config you’re using to train (in the conf/ registry)? For the numbers you cite, we’re actually only running one epoch by default.
Can you let me know what your training set up is (# of GPUs, batch size, gradient accumulation if any). We’ve noticed that training with smaller batches really hurts performance.
Finally, can you dump the versions of PyTorch, Transformers, and Tokenizers that you’re using?

Thanks again for using our code — hopefully we can resolve this fairly quickly!

tayton42 commented 6 months ago

Couple of questions:

Can you point to the exact config you’re using to train (in the conf/ registry)? For the numbers you cite, we’re actually only running one epoch by default.

Can you let me know what your training set up is (# of GPUs, batch size, gradient accumulation if any). We’ve noticed that training with smaller batches really hurts performance.

Finally, can you dump the versions of PyTorch, Transformers, and Tokenizers that you’re using?

Thanks again for using our code — hopefully we can resolve this fairly quickly!

Thank you for your help!Actually,I directly use the training script configuration, below is my training script.

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --standalone --nnodes 1 --nproc-per-node 8 scripts/pretrain.py \
  --model.type "one-stage+7b" \
  --model.model_id "dinov2siglip-patch14-rep" \
  --model.arch_specifier "fused-gelu-mlp"\
  --model.vision_backbone_id "dinosiglip-vit-patch14-384px" \
  --model.image_resize_strategy "resize-naive" \
  --model.llm_backbone_id "vicuna-v15-7b" \
  --model.enable_mixed_precision_training True \
  --model.align_epochs 1 \
  --model.finetune_epochs 1

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --standalone --nnodes 1 --nproc-per-node 8 scripts/pretrain.py \
  --stage "finetune" \
  --pretrained_checkpoint "/xxx/prismatic-vlms/runs/dinov2siglip-patch14-rep+stage-align+x7/checkpoints/latest-checkpoint.pt" \
  --model.finetune_per_device_batch_size 8 \
  --model.type "one-stage+7b" \
  --model.model_id "dinov2siglip-patch14-rep" \
  --model.arch_specifier "fused-gelu-mlp"\
  --model.vision_backbone_id "dinosiglip-vit-patch14-384px" \
  --model.image_resize_strategy "resize-naive" \
  --model.llm_backbone_id "vicuna-v15-7b" \
  --model.enable_mixed_precision_training True \
  --model.align_epochs 1 \
  --model.finetune_epochs 2

And this is the config file I generated.

dataset:
  align_stage_components:
  - download/llava-laion-cc-sbu-558k/chat.json
  - download/llava-laion-cc-sbu-558k
  dataset_id: llava-v15
  dataset_root_dir: data
  finetune_stage_components:
  - download/llava-v1.5-instruct/llava_v1_5_mix665k.json
  - download/llava-v1.5-instruct
  type: llava-v15
hf_token: /opt/cv/tianyutong/prismatic-vlms/hf_token
model:
  align_epochs: 1
  align_global_batch_size: 256
  align_learning_rate: 0.001
  align_lr_scheduler_type: linear-warmup+cosine-decay
  align_max_grad_norm: 1.0
  align_max_steps: null
  align_per_device_batch_size: 16
  align_train_strategy: fsdp-shard-grad-op
  align_warmup_ratio: 0.03
  align_weight_decay: 0.0
  arch_specifier: fused-gelu-mlp
  enable_gradient_checkpointing: true
  enable_mixed_precision_training: true
  finetune_epochs: 2
  finetune_global_batch_size: 128
  finetune_learning_rate: 2.0e-05
  finetune_lr_scheduler_type: linear-warmup+cosine-decay
  finetune_max_grad_norm: 1.0
  finetune_max_steps: null
  finetune_per_device_batch_size: 8
  finetune_train_strategy: fsdp-full-shard
  finetune_warmup_ratio: 0.03
  finetune_weight_decay: 0.1
  image_resize_strategy: resize-naive
  llm_backbone_id: vicuna-v15-7b
  llm_max_length: 2048
  model_id: dinov2siglip-patch14-rep
  reduce_in_full_precision: false
  type: one-stage+7b
  vision_backbone_id: dinosiglip-vit-patch14-384px
pretrained_checkpoint: /opt/cv/tianyutong/prismatic-vlms/runstemp/dinov2siglip-patch14-rep+stage-align+x7/checkpoints/latest-checkpoint.pt
run_id: dinov2siglip-patch14-rep+stage-finetune+x7
run_root_dir: runstemp
seed: 7
stage: finetune
trackers:
- jsonl
- wandb
wandb_entity: tayton
wandb_project: onyx-vlms

The configuration I used is torch==2.1.2 transformers==4.34.1 tokenizers==0.14.1, and I trained on 8 A800 cards. The final metrics only show that the Localization Benchmarks and vizwiz metrics are much lower than those provided in your paper, while the other metrics are normal. Could this be a problem caused by the batch size?I set the finetune_per_device_batch_size to 8.

siddk commented 6 months ago

Oh - training is super sensitive to the choice batch size; we found in early experiments that per-device batch size of 16 is minimal necessary to get stable (and good) performance. We never really dug into why beyond just that a batch size of 64 might just be too small... but that could definitely explain the discrepancy here.

tayton42 commented 6 months ago

per-device batch size

Even though I set the per-device batch size to 8, the global batch size is still 128. Would this still cause such a difference?

tayton42 commented 6 months ago

Oh - training is super sensitive to the choice batch size; we found in early experiments that per-device batch size of 16 is minimal necessary to get stable (and good) performance. We never really dug into why beyond just that a batch size of 64 might just be too small... but that could definitely explain the discrepancy here.

I also noticed that your experiment is based on training with llama2, while mine is based on training with vicuna-v15-7b. Which of these two LLMs would be better?

sahilqure commented 6 months ago

@siddk You should provide all the training configs properly.

sahilqure commented 6 months ago

@siddk-tri This much difference won't come due to batch size. Have done multiple of experiments on LLava with different batch size. It's not the case

siddk commented 6 months ago

@tayton42 - so even though the global batch size is set to 128, you'll be doing 2 steps of gradient accumulation; under FSDP mixed precision, we've noticed that leads to degraded performance like I had mentioned (I do think it's a difference due to FSDP, vs. the DeepSpeed engine that LLaVa uses for training).

Our final "Prism" configs use Llama-2 as the backbone; for the experiment you're trying to reproduce with just DINO-SigLIP, we used the original Vicuña v1.5 model.

You can see the original config (and training configs for all models in general) here: https://github.com/TRI-ML/prismatic-vlms/blob/main/prismatic/conf/models.py#L239

zjysteven commented 6 months ago

@siddk Hi Sidd thanks for putting up this great repo; super helpful. I have a follow-up question on this and would appreciate your thought.

Based on the discussion here my impression is that DeepSpeed might suit better if we need to use gradient accumulation (e.g. I have 46G GPUs which can only fit a per-device batch size of 8), right? If I'd like to switch from FSDP to DeepSpeed, would you mind listing a few high-level things that I need to change within the structure of your code?

zjysteven commented 6 months ago

@tayton42 Hello there, just curious have you managed to solve the issue here?

siddk commented 6 months ago

Hey @zjysteven; thanks so much for your interest in using our codebase. To add a DeepSpeed integration, the right places to poke around/change would be the implementation of the training strategies in [prismatic/training/strategies](https://github.com/TRI-ML/prismatic-vlms/tree/main/prismatic/training/strategies) -- I'd look at the fsdp strategy first, and adapt to DeepSpeed as needed.

You might also consider using HF accelerate to rewrite the train logic so you don't have to deal with DeepSpeed yourself. In that case, you may also need to write the core training loop in base_strategy.py.

Separately, I'm a bit swamped right now with the CoRL deadline next week. However, afterwards I'm planning on pushing some HF compatible versions of the Prismatic models (e.g., PrismaticForConditionalGeneration) that can be used with the HF Trainer out of the box! Will update this thread as soon as I do so.

zjysteven commented 6 months ago

Thank you for detailed instructions! Yes after some investigation I also believe using accelerate + deepspeed would be the direction.

TRI-ML / prismatic-vlms

The reproduction issue of DINOv2 + SigLIP 384px (Naive Resize) #32