MedKhem / grobid-dictionaries

31 stars 7 forks source link

First line of the body disappears (1st model) #20

Closed bsagot closed 6 years ago

bsagot commented 6 years ago

corpus.zip puhvel-h-1-3.pdf

As discussed with Mohamed a few seconds ago, when using the attached training data for the first model (corpus.zip), I get a 100% f-measure when evaluating on the training data, but then when I throw the attached PDF (i.e., the first 3 pages of my PDF, which are included in the training data), the first line of the body-part of each page simply disappears.

bsagot commented 6 years ago

In the character stream, there is no boundary between the headnote and the body, no whitespace, nothing. Given what I see in the feature file (i.e. the fact that features are associated with the beginning of each "PDF-line"), this might be related to what is causing the issue.