NSTiwari / Fine-tune-IDEFICS-Vision-Language-Model

This repository demonstrates the data preparation and fine-tuning the IDEFICS Vision Language Model.
MIT License
17 stars 1 forks source link

Fine tune Idefics2-8B Vision Language Model

This repository demonstrates the data preparation and fine-tuning the Idefics2-8B Vision Language Model.

Vision Language Model

Vision Language Models are multimodal models that learn from images and text, generating text outputs from image and text inputs. They excel in zero-shot capabilities, generalization, and various tasks like image recognition, question answering, and document understanding.

Dataset

Inference

Question: What the location address of NSDA?

Answer: ['1128 SIXTEENTH ST., N. W., WASHINGTON, D. C. 20036', '1128 sixteenth st., N. W., washington, D. C. 20036']

References & Resources: