allenai / mmda

multimodal document analysis
Apache License 2.0
158 stars 18 forks source link

Grobid augmenter body sections, paragraphs, sentences #275

Closed geli-gel closed 11 months ago

geli-gel commented 11 months ago

This PR makes the grobid augmenter return what Grobid provides in the <body> section of the XML for what we call "Sections" made of "Headers" and "Body Text" which Grobid provides as coordinates of <head> (headers) <p> (paragraphs) and <s> (sentences).

While working on the test for this, I noticed that the number of sentences found in the body text (249) was not the same as the number of times the <s> tag was found in the actual XML (271). I found that some of the extras (8 of them) were from the paper Abstract which Grobid does not return as part of the body text but under <teiHeader>ProfileDesc>Abstract>Div>, and the rest were from Figure and Table <div>s (14 of them).

I decided to leave all of those sentences out (lone sentences without any encompassing section) since for our current purposes we're just interested in the body text within "Sections"