Document process for addition of new OCR engine / model

Please let me know if I am misinterpreting/misunderstanding your comment.

I believe I can simplify the OCR method a bit. If the VLM is available through Hugging Face and can be implemented with the basic transformer workflow, then I can refactor the implementation such that the user can just provide a model id. That would work for Phi-3-vision and Qwen2-VL.

Since 4o-mini is an API, its implementation is a bit different. I think the 4o-mini implementation does also currently work interchangeably with 4o (I just didn't broadcast that because 4o is extremely expensive for high res images).

Florence-2 also requires a unique implementation.

In VoucherVision.yaml each of these models/engines can be specified as an item in a list. The list determines which OCR is used, stacking the OCR in the LLM prompt if more than 1 are provided.

I will do the following:

[ ] Make is clear where/how in the OCR workflow to add a custom engine (similar to how I added Florence-2)
[ ] Refactor the Hugging Face local OCR workflow so that the user can simply provide the model id to use a different model (which should work for Phi-3-vision, Qwen2-VL, etc.)
[ ] Provide user with more control over the API-based engines (GPT-4o-mini, other foundation VLMs)

Gene-Weaver / VoucherVision

Document process for addition of new OCR engine / model #37

17

18

19

36