Closed Gene-Weaver closed 2 months ago
microsoft/Phi-3.5-vision-instruct
performs too poorly in experiments to be included in VoucherVision right now. Will revisit later.Experiments with microsoft/Phi-3.5-vision-instruct
showed that it's difficult to engineer a prompt to get only the OCR without additional commentary. Instead of just text, usually we get something like:
"The image contains several pieces of text, which appear to be from a scientific document or label related to botany or herbology. The text includes a collection number (AM5N1), a date (24 November 1994), and a description that mentions 'Pitts arbor and glossy fruits, feathery dents.' The common name of the plant is listed as 'Ammannia bambusoides' (Bambus Forest), and the family is identified as 'Ochnaceae.' The determination is given as 'Campylospermum reticulatum (P. Beau) Farron.' There is also a national herbarium number (NHB 1942), a herbarium number (682 074), and a herbarium location (HERB. HORTI BOT. NAT. BELG. (BR)). The document is dated 03. XII 2014 and has a barcode with the number 682 074. There is a handwritten note that says 'Budotowag' and a signature that reads 'W.R. 1997.' The image also includes a logo of the New York Zoological Society of New York and the address of the Zoo, Zaire, Zaire, Zaire."
Which is much more challenging for the LLM to parse. As a small 7B-class model, all-in-one processing is also not working. You can find examples of the all-in-one working for simpler scenarios like this where the JSON fields are more obvious. But for our transcription needs it's not working well enough.