🤗 Hugging Face  
After a month's relentless efforts, today we are thrilled to release EraX-VL-7B-V1!
NOTA BENE: EraX-VL-7B-V1 is NOT a typical OCR-only tool likes Tesseract but is a Multimodal LLM-based model. To use it effectively, you may have to twist your prompt carefully depending on your tasks.
EraX-VL-7B-V1
is the latest version of the vision language models in the EraX model families.
Below is the evaluation benchmark of global open-source and proprietary Multimodal Models on the MTVQA Vietnamese test set conducted by VinBigdata. We plan to conduct more detailed and diverse evaluations in the near future.
Below, we provide simple examples to show how to use EraX-VL-7B-V1
🤗 Transformers.
The code of EraX-VL-7B-V1
has been in the latest Hugging face transformers and we advise you to build from source with command:
Install the necessary packages:
python -m pip install git+https://github.com/huggingface/transformers@21fac7abba2a37fae86106f87fcf9974fd1e3830 accelerate
python -m pip install qwen-vl-utils
pip install flash-attn --no-build-isolation
import os
import base64
import json
import cv2
import numpy as np
import matplotlib.pyplot as plt
import torch
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
model_path = "erax/EraX-VL-7B-V1"
model = Qwen2VLForConditionalGeneration.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
attn_implementation="eager", # replace with "flash_attention_2" if your GPU is Ampere architecture
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_path)
# processor = AutoProcessor.from_pretrained(model_path)
min_pixels = 256 * 28 * 28
max_pixels = 1280 * 28 * 28
processor = AutoProcessor.from_pretrained(
model_path,
min_pixels=min_pixels,
max_pixels=max_pixels,
)
image_path = "image.jpg"
with open(image_path, "rb") as f:
encoded_image = base64.b64encode(f.read())
decoded_image_text = encoded_image.decode('utf-8')
base64_data = f"data:image;base64,{decoded_image_text}"
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": base64_data,
},
{
"type": "text",
"text": "Diễn tả nội dung bức ảnh này bằng định dạng json."
},
],
}
]
# Prepare prompt
tokenized_text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[ tokenized_text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Generation configs
generation_config = model.generation_config
generation_config.do_sample = True
generation_config.temperature = 0.2
generation_config.top_k = 1
generation_config.top_p = 0.001
generation_config.max_new_tokens = 2048
generation_config.repetition_penalty = 1.1
# Inference
generated_ids = model.generate(**inputs, generation_config=generation_config)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])
Install erax-vl-7b-v1
package:
pip install erax-vl-7b-v1==0.1.0
Then you can use this library for image extraction task like this:
import os
from erax_vl_7b_v1.utils import (
process_lr,
get_json,
openBase64_Image,
add_img_content,
add_pdf_content,
add_pdf_content_json
)
from erax_vl_7b_v1.erax_api_lib import (
API_Image_OCR_EraX_VL_7B_vLLM,
API_PDF_OCR_EraX_VL_7B_vLLM,
API_Chat_OCR_EraX_VL_7B_vLLM,
API_Multiple_Images_OCR_EraX_VL_7B_vLLM,
API_PDF_Full_OCR_EraX_VL_7B_vLLM
)
ERAX_URL_ID = "EraX's URL ID"
API_KEY = "EraX's API Key"
image_path = "image.jpg"
prompt = """Hãy trích xuất toàn bộ chi tiết của các bức ảnh này theo đúng thứ tự của nội dung bằng định dạng json và không bình luận gì thêm."""
result, history = API_Image_OCR_EraX_VL_7B_vLLM(
image_paths=image_path,
is_base64=False,
prompt=prompt,
erax_url_id=ERAX_URL_ID,
API_key=API_KEY,
)
# Convert string json to json. It is result.
json_result = get_json(result)
print(json_result)
If you find our project useful, we would appreciate it if you could star our repository and cite our work as follows:
@article{EraX-VL-7B-V1,
title={EraX-VL-7B-V1: A Highly Efficient Multimodal LLM for Vietnamese, especially for medical forms and bills},
author={Nguyễn Anh Nguyên and Nguyễn Hồ Nam (BCG) and Hoàng Tiến Dũng and Phạm Đình Thục and Phạm Huỳnh Nhật},
organization={EraX},
year={2024},
url={https://huggingface.co/erax-ai/EraX-VL-7B-V1}
}
EraX-VL-7B-V1
is built with reference to the code of the following projects: Qwen2-VL, InternVL and Khang Đoàn (5CD-AI). Thanks for their awesome work!