Fine tune Idefics2-8B Vision Language Model

This repository demonstrates the data preparation and fine-tuning the Idefics2-8B Vision Language Model.

Vision Language Model

Vision Language Models are multimodal models that learn from images and text, generating text outputs from image and text inputs. They excel in zero-shot capabilities, generalization, and various tasks like image recognition, question answering, and document understanding.

Dataset

Inference

Question: What the location address of NSDA?

Answer: ['1128 SIXTEENTH ST., N. W., WASHINGTON, D. C. 20036', '1128 sixteenth st., N. W., washington, D. C. 20036']

References & Resources:

Read the Medium blog for step-by-step imeplementation.
Vision Language Models
LoRA & QLoRA
Idefics2-8B Vision Language Model

NSTiwari / Fine-tune-IDEFICS-Vision-Language-Model

readme

Fine tune Idefics2-8B Vision Language Model

Vision Language Model

Dataset

Inference

References & Resources: