Open LWaetzig opened 10 months ago
instead of tesseract, exploration of the viability of https://layout-parser.github.io/
will take a deeper at that next week, other features would remain unchanged
layoutparser base model performance is worse than standard pdf extraction exploration of own model creation via training of own detectron2 model like this: https://www.youtube.com/watch?v=puOKTFXRyr4
Objective
create own tesseract model using pytesseract to improve extraction from pdf files. Compair results with basic extraction using pymudf or pypdf2
Key Features