This repository demonstrates the data preparation and fine-tuning the Idefics2-8B Vision Language Model.
Vision Language Models are multimodal models that learn from images and text, generating text outputs from image and text inputs. They excel in zero-shot capabilities, generalization, and various tasks like image recognition, question answering, and document understanding.
Question: What the location address of NSDA?
Answer: ['1128 SIXTEENTH ST., N. W., WASHINGTON, D. C. 20036', '1128 sixteenth st., N. W., washington, D. C. 20036']