Finetune input prompts through prompt engineering

ipiriyan2002 commented 1 month ago

Intro:

It can be noted that several LLMs including the QWEN model struggle with understanding prompts like humans do and thus output structure is often unstructured. So it is important to give in a proper prompt to the model to return expected structured text.

This is true with the model (QWEN2_VL_7B) that is being used for text recognition in this project. It is a model that excels in reading text off images so a proper prompt with examples of the structure is needed.

Initial Prompt:

"Process this image in two steps:

Step 1: Turn the text in this picture into markdown. Indicate italics. Indent lines which discuss folder contents as unordered lists

Step 2: Convert the markdown text that you created in step 1 into JSON. Use the heading texts as keys, and the folder details as nested numbered lists

Your output should consist of the markdown text, then a fenced JSON code block"

==> The initial prompt was unsuccessful given no examples and it is multi-stepped (QWEN model in particular struggles with multi-step instructions according to manual)

New prompt draft 1:

"The image consists of a catalogue of plant collection names from the Lightfoot collection. It follows the hierarchy of Taxon name, species name and the collection inside the folder consisting of species name, where it was collected, who collected it (citation) and other meta data.

I want you to extract the information from the catalogue and prepare a structured json output. If a text does not have taxon name or species name either at the start or at the very end, please mark that down as extra_COUNTER.

Example:

{"CAPRIFOLIACEAE" : {"Linnaea borealis L.": [ {"folder": 1, "content": "Linnea borealis [JL]. i. Cites Linn. Sp. Pl. 631; Bauh. Pin. 93"} ] } }

Example (extra data at the start or end of image without taxon):

{"extra_data_1":{"content": "Folder 2. Campanula hederacea [G]. i. "Devon & Cornwal" [JL]"}, "CAPRIFOLIACEAE" : {"Linnaea borealis L.": [ {"folder": 1, "content": "Linnea borealis [JL]. i. Cites Linn. Sp. Pl. 631; Bauh. Pin. 93"} ] } }

Your output should be a structures json format as shown above in the examples."

==> This prompt seems to show an improvement to the output with consistent json format for input images. (Need further testing and finetuning)

Conclusion: Further prompting would reduce the time for post-processing and keep the output structured for easy information retrieval. Test out new prompts and post them here.

ipiriyan2002 commented 3 weeks ago

After further experimenting, setting context through system prompts helps in generating the desired JSON structure through the model. More finetuning is needed to get all data from the pages with correct structure where appropriate.

System prompts added can be found in lib/config.py

ipiriyan2002 commented 3 weeks ago

Added PromptLoader class now and prompts are stored as YAML files in the prompts directory. A default prompt file is available from which fields can be inherited from.

Both system and user prompts can be added to these YAML files.

KewBridge / LightfootCatalogue

Finetune input prompts through prompt engineering #6