Open nickynicolson opened 2 months ago
Trying it now and I agree that it is excellent. I have tried it for just OCR (works great) and as an all-in-one prompt (works, but misses some very easy fields). I am going to try doing OCR and immediately feeding it back the OCR for parsing to see if that helps. Otherwise this will be a great OCR model. It requires 22GB of VRAM for our image sizes, so not as small as Florence-2, but it does seem to provide better results.
Qwen/Qwen2-VL-7B-Instruct-AWQ
will run on a 16GB card, but seems better suited for only OCR, had several failures with trying to parse JSON simultaneously.
Tests today showed that having Qwen extract text first and then sending the extracted text to be parsed is the most accurate. So 2 calls total per label. Also, it seems to work very well with non-English text.
Seems to have excellent performance on handwritten text: https://simonwillison.net/2024/Sep/4/