create own tesseract model

LWaetzig / StudentChatbot

1 stars 0 forks source link

create own tesseract model #12

Open LWaetzig opened 10 months ago

LWaetzig commented 10 months ago

Objective

create own tesseract model using pytesseract to improve extraction from pdf files. Compair results with basic extraction using pymudf or pypdf2

Key Features

[ ] own model is trained and evaluated
[ ] results are documented and compaired

lanteanair commented 10 months ago

instead of tesseract, exploration of the viability of https://layout-parser.github.io/

includes pretrained model zoo
includes possibility to train custom models
identitfies layout elements in documents so filtering headers / footers might be possible for better data quality
downside: technical difficulties, no working prototype yet

will take a deeper at that next week, other features would remain unchanged

lanteanair commented 10 months ago

layoutparser base model performance is worse than standard pdf extraction exploration of own model creation via training of own detectron2 model like this: https://www.youtube.com/watch?v=puOKTFXRyr4