Open nickynicolson opened 2 months ago
Please let me know if I am misinterpreting/misunderstanding your comment.
I believe I can simplify the OCR method a bit. If the VLM is available through Hugging Face and can be implemented with the basic transformer
workflow, then I can refactor the implementation such that the user can just provide a model id. That would work for Phi-3-vision and Qwen2-VL.
Since 4o-mini is an API, its implementation is a bit different. I think the 4o-mini implementation does also currently work interchangeably with 4o (I just didn't broadcast that because 4o is extremely expensive for high res images).
Florence-2 also requires a unique implementation.
In VoucherVision.yaml
each of these models/engines can be specified as an item in a list. The list determines which OCR is used, stacking the OCR in the LLM prompt if more than 1 are provided.
I will do the following:
I can see that there are multiple issues of the form "add X as a new OCR engine":
17
18
19
36
... therefore would it be sensible to document the steps and / or rearchitect such that these could be added via configuration?