This PR makes the grobid augmenter return what Grobid provides in the <body> section of the XML for what we call "Sections" made of "Headers" and "Body Text" which Grobid provides as coordinates of <head> (headers) <p> (paragraphs) and <s> (sentences).
While working on the test for this, I noticed that the number of sentences found in the body text (249) was not the same as the number of times the <s> tag was found in the actual XML (271). I found that some of the extras (8 of them) were from the paper Abstract which Grobid does not return as part of the body text but under <teiHeader>ProfileDesc>Abstract>Div>, and the rest were from Figure and Table <div>s (14 of them).
I decided to leave all of those sentences out (lone sentences without any encompassing section) since for our current purposes we're just interested in the body text within "Sections"
This PR makes the grobid augmenter return what Grobid provides in the
<body>
section of the XML for what we call "Sections" made of "Headers" and "Body Text" which Grobid provides as coordinates of<head>
(headers)<p>
(paragraphs) and<s>
(sentences).While working on the test for this, I noticed that the number of sentences found in the body text (249) was not the same as the number of times the
<s>
tag was found in the actual XML (271). I found that some of the extras (8 of them) were from the paper Abstract which Grobid does not return as part of the body text but under<teiHeader>ProfileDesc>Abstract>Div>
, and the rest were from Figure and Table<div>
s (14 of them).I decided to leave all of those sentences out (lone sentences without any encompassing section) since for our current purposes we're just interested in the body text within "Sections"