bdzyubak / torch-control

A top-level repo for evaluating natively available models
MIT License
2 stars 0 forks source link

Build pipeline to extract structured text from images #30

Closed bdzyubak closed 1 month ago

bdzyubak commented 1 month ago

The goal of this project is to get a pipeline which is able to extract desired fields from an image practicing LLM fine-tuning/prompt engineering.

Approach: Use image OCR (optical character recognition) to extract unstructured text. Use LLM to summarize desired fields from unstructured text.

Dataset: Found a receipts dataset on Kaggle (https://www.kaggle.com/datasets/trainingdatapro/ocr-receipts-text-detection) which has the desired characteristics: 1) Some fields that are nearly always present e.g. store 2) This field may sometimes be absent, which needs to be handled by the pipeline 3) There are many fields with varying frequency of occurrence, so the list of fields to look for can be varied to product problems of different complexities.

bdzyubak commented 1 month ago

Splitting the text extraction pipeline into separate issues: 1) Bounding box detection - TrOCR is a textline tool and requires these lines to be extracted first. Whole images yield very poor predictions with few tokens. 2) Text inference on bounding boxes to extract all text in image. 3) Text summarization to extract important information. https://github.com/bdzyubak/torch-control/issues/32 4) Ability to fine tune text extraction (requires bounding box detection).

bdzyubak commented 1 month ago

Merging the branch for now with ability to run inference on textlines, assuming they are extracted correctly. projects/ComputerVision/ocr_receipts_sroie/trocr_inference.py

WIPs for training are also available. Ideally, these would be backed up as tags rather than being merged into the main branch, but these are scripts and therefore are low risk.
projects/ComputerVision/ocr_receipts_sroie projects/ComputerVision/kaggle_ocr_receipts