FoundationVision / Groma

[ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization
https://groma-mllm.github.io/
Apache License 2.0
483 stars 55 forks source link

Finetuning and dataset formatting guidelines #2

Closed hinsonan closed 1 month ago

hinsonan commented 2 months ago

Very cool work and congrats on what you have accomplished. Wanted to know if you all had plans to release a finetuning guide and how to format datasets

machuofan commented 2 months ago

Hi there, thanks for your interest in our work. Here are some tips you may follow to finetune the model on customized datasets:

  1. Format your data. There are various dataset templates under groma/data/datasets. For example, you can refer to refcoco_rec.py to format REC data, visual_genome.py for region captioning, llava.py for conversation, and so on. BTW, don't forget to register the new dataset in groma/data/build.py.
  2. Download the pretrained checkpoint groma-7b-pretrain.
  3. Config groma/data/configs/vl_finetune.py and scripts/vl_finetune.sh, then run bash scripts/vl_finetune.sh {path_to_groma_7b_pretrain_ckpt} {output_dir}.
hinsonan commented 2 months ago

thank you for your response. Perhaps if i have some time i can update documentation and provide a fine-tuning section. Someone else may be able to get to it sooner than me