Extracting section headings and chunks from PDFs

UP-LIFT / .github

0 stars 1 forks source link

Closed GautamR-Samagra closed 10 months ago

GautamR-Samagra commented 11 months ago

Sample pdf here

We need to be able to extract text from it and be able to chunk it in the form of headings and related chunks.

We have tired 2 different approaches :

Using Deepdoc detection to extract the text headings and structure of each page and converting it into a json format : here
Using Pymupdf to get the boundaries of the text from the pdf and then using that to figure out the headings and the content pieces : here

GautamR-Samagra commented 10 months ago

Collab for getting structure out using approach 2: link

GautamR-Samagra commented 10 months ago

PyMupdf approach works well. Moved to PDF Parser for now.