Add support for Qwen2-VL as an OCR engine

Gene-Weaver / VoucherVision

Initiated by the University of Michigan Herbarium, VoucherVision harnesses the power of large language models (LLMs) to transform the transcription process of natural history specimen labels.

https://huggingface.co/spaces/phyloforfun/VoucherVision

GNU General Public License v3.0

18 stars 4 forks source link

Add support for Qwen2-VL as an OCR engine #36

Open nickynicolson opened 2 months ago

nickynicolson commented 2 months ago

Seems to have excellent performance on handwritten text: https://simonwillison.net/2024/Sep/4/

Gene-Weaver commented 2 months ago

Trying it now and I agree that it is excellent. I have tried it for just OCR (works great) and as an all-in-one prompt (works, but misses some very easy fields). I am going to try doing OCR and immediately feeding it back the OCR for parsing to see if that helps. Otherwise this will be a great OCR model. It requires 22GB of VRAM for our image sizes, so not as small as Florence-2, but it does seem to provide better results.

Gene-Weaver commented 2 months ago

Qwen/Qwen2-VL-7B-Instruct-AWQ will run on a 16GB card, but seems better suited for only OCR, had several failures with trying to parse JSON simultaneously.

Gene-Weaver commented 2 months ago

Tests today showed that having Qwen extract text first and then sending the extracted text to be parsed is the most accurate. So 2 calls total per label. Also, it seems to work very well with non-English text.