curiosity-ai / catalyst

🚀 Catalyst is a C# Natural Language Processing library built for speed. Inspired by spaCy's design, it brings pre-trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models.
MIT License
699 stars 71 forks source link

collect the important detail from invoice document (pdf) #89

Open shq251 opened 1 year ago

shq251 commented 1 year ago

Hi all,

I want to prepare a project to collect the important detail from invoice document pdf (Like, Invoice Number, Date, Total Due, Seller Name etc.) as Key-value pairs. We prepare the HOCR file from pdf file using OCR engine (Tesseract). Kindly help us how further proceed with input HOCR file to extract key-value pairs using "catalyst".

Or other approach to prepare Key-value pairs using "catalyst".

Thank in advance.