LWaetzig / StudentChatbot

1 stars 0 forks source link

create own tesseract model #12

Open LWaetzig opened 10 months ago

LWaetzig commented 10 months ago

Objective

create own tesseract model using pytesseract to improve extraction from pdf files. Compair results with basic extraction using pymudf or pypdf2

Key Features

lanteanair commented 10 months ago

instead of tesseract, exploration of the viability of https://layout-parser.github.io/

will take a deeper at that next week, other features would remain unchanged

lanteanair commented 10 months ago

layoutparser base model performance is worse than standard pdf extraction exploration of own model creation via training of own detectron2 model like this: https://www.youtube.com/watch?v=puOKTFXRyr4