Gene-Weaver / VoucherVision

Initiated by the University of Michigan Herbarium, VoucherVision harnesses the power of large language models (LLMs) to transform the transcription process of natural history specimen labels.
https://huggingface.co/spaces/phyloforfun/VoucherVision
GNU General Public License v3.0
18 stars 4 forks source link

Add support for Phi-3-vision as an OCR engine #18

Closed Gene-Weaver closed 2 months ago

Gene-Weaver commented 4 months ago
Gene-Weaver commented 2 months ago

TL;DR - microsoft/Phi-3.5-vision-instruct performs too poorly in experiments to be included in VoucherVision right now. Will revisit later.

Experiments with microsoft/Phi-3.5-vision-instruct showed that it's difficult to engineer a prompt to get only the OCR without additional commentary. Instead of just text, usually we get something like:

"The image contains several pieces of text, which appear to be from a scientific document or label related to botany or herbology. The text includes a collection number (AM5N1), a date (24 November 1994), and a description that mentions 'Pitts arbor and glossy fruits, feathery dents.' The common name of the plant is listed as 'Ammannia bambusoides' (Bambus Forest), and the family is identified as 'Ochnaceae.' The determination is given as 'Campylospermum reticulatum (P. Beau) Farron.' There is also a national herbarium number (NHB 1942), a herbarium number (682 074), and a herbarium location (HERB. HORTI BOT. NAT. BELG. (BR)). The document is dated 03. XII 2014 and has a barcode with the number 682 074. There is a handwritten note that says 'Budotowag' and a signature that reads 'W.R. 1997.' The image also includes a logo of the New York Zoological Society of New York and the address of the Zoo, Zaire, Zaire, Zaire."

Which is much more challenging for the LLM to parse. As a small 7B-class model, all-in-one processing is also not working. You can find examples of the all-in-one working for simpler scenarios like this where the JSON fields are more obvious. But for our transcription needs it's not working well enough.