Closed bdzyubak closed 1 month ago
Splitting the text extraction pipeline into separate issues: 1) Bounding box detection - TrOCR is a textline tool and requires these lines to be extracted first. Whole images yield very poor predictions with few tokens. 2) Text inference on bounding boxes to extract all text in image. 3) Text summarization to extract important information. https://github.com/bdzyubak/torch-control/issues/32 4) Ability to fine tune text extraction (requires bounding box detection).
Merging the branch for now with ability to run inference on textlines, assuming they are extracted correctly. projects/ComputerVision/ocr_receipts_sroie/trocr_inference.py
WIPs for training are also available. Ideally, these would be backed up as tags rather than being merged into the main branch, but these are scripts and therefore are low risk.
projects/ComputerVision/ocr_receipts_sroie
projects/ComputerVision/kaggle_ocr_receipts
The goal of this project is to get a pipeline which is able to extract desired fields from an image practicing LLM fine-tuning/prompt engineering.
Approach: Use image OCR (optical character recognition) to extract unstructured text. Use LLM to summarize desired fields from unstructured text.
Dataset: Found a receipts dataset on Kaggle (https://www.kaggle.com/datasets/trainingdatapro/ocr-receipts-text-detection) which has the desired characteristics: 1) Some fields that are nearly always present e.g. store 2) This field may sometimes be absent, which needs to be handled by the pipeline 3) There are many fields with varying frequency of occurrence, so the list of fields to look for can be varied to product problems of different complexities.