TIGER-AI-Lab / VLM2Vec

This repo contains the code and data for "VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks"
https://tiger-ai-lab.github.io/VLM2Vec/
Apache License 2.0
80 stars 1 forks source link

Hidden size mismatch #6

Closed marcobellagente93 closed 2 weeks ago

marcobellagente93 commented 2 weeks ago

I re-trained the model as per the README running:

torchrun --nproc_per_node=8 --master_port=22447 --max_restarts=0 train.py \ --model_name microsoft/Phi-3.5-vision-instruct --bf16 --pooling last \ --dataset_name TIGER-Lab/MMEB-train \ --subset_name A-OKVQA CIRR DocVQA ImageNet-A ImageNet_1K MSCOCO MSCOCO_t2i OK-VQA VisDial Visual7W-pointing CIFAR_100 ChartQA FashionIQ ImageNet-R InfographicsVQA MSCOCO_i2t NIGHTS VOC2007 Visual7W WebQA\ --num_sample_per_subset 50000 \ --image_dir MMEB-train \ --max_len 256 --num_crops 16 --output_dir outputs_bs_64_c_16 --logging_steps 10 \ --lr_scheduler_type linear --learning_rate 2e-5 --max_steps 2000 \ --warmup_steps 200 --save_steps 1000 --normalize True \ --temperature 0.02 --per_device_train_batch_size 8 \ --grad_cache True --gc_q_chunk_size 1 --gc_p_chunk_size 1

However, I noticed the checkpoints produced have the incorrect "hidden_size": 4096. Manually correcting it in config.json solves the problem and I could reproduce similar numbers to those reported in the paper, however I wonder if you have an idea of what might cause it

wenhuchen commented 2 weeks ago

We already fixed the issue in our newest commit. Can you pull again?

marcobellagente93 commented 2 weeks ago

Thanks, I was indeed on an old commit