allenai / papermage

library supporting NLP and CV research on scientific papers
https://papermage.org
Apache License 2.0
692 stars 54 forks source link

How to improve detection of sections? #65

Open ldt opened 9 months ago

ldt commented 9 months ago

Hi,

Congrats for your great work and beautiful API!

I'm especially interested in using it to create a hierarchical document based on the original PDF. My issue is that some sections are not correctly identified.

For example in your papermage.pdf file, the 2nd section is mixed with the 2.1 section:

image

And the title of the 3.3 section is partially identified:

image

I have similar issues on some of my documents.

I would like to know how it could be improved. Could it be more trained if there was a training set of documents with the correct sections that were pre-identified?

Let me know how I could help, the topic is really interesting!

MpLebron commented 6 months ago

I have the same problem! Hope the authors can sovle this!