Gene-Weaver / VoucherVision

Initiated by the University of Michigan Herbarium, VoucherVision harnesses the power of large language models (LLMs) to transform the transcription process of natural history specimen labels.
https://huggingface.co/spaces/phyloforfun/VoucherVision
GNU General Public License v3.0
18 stars 4 forks source link

Document process for addition of new OCR engine / model #37

Open nickynicolson opened 2 months ago

nickynicolson commented 2 months ago

I can see that there are multiple issues of the form "add X as a new OCR engine":

Gene-Weaver commented 2 months ago

Please let me know if I am misinterpreting/misunderstanding your comment.

I believe I can simplify the OCR method a bit. If the VLM is available through Hugging Face and can be implemented with the basic transformer workflow, then I can refactor the implementation such that the user can just provide a model id. That would work for Phi-3-vision and Qwen2-VL.

Since 4o-mini is an API, its implementation is a bit different. I think the 4o-mini implementation does also currently work interchangeably with 4o (I just didn't broadcast that because 4o is extremely expensive for high res images).

Florence-2 also requires a unique implementation.

In VoucherVision.yaml each of these models/engines can be specified as an item in a list. The list determines which OCR is used, stacking the OCR in the LLM prompt if more than 1 are provided.

I will do the following: