Open MedKhem opened 6 years ago
Hey there, just bumping as I ended up having the situation with an ill-formed document :) Any way to avoid this ? The document and the training+evaluation data are available here : https://github.com/lascivaroma/lexical/tree/master/grobid-data/Dictionaire%20des%20synonymes%20latins
hey! what would you consider as ill-formed structures? the sequence of \<fw>s?
In this specific case, a <
from the OCR is not escaped, creating a tag <χήρ
line 814 that makes the document ill-formed :)
In this specific case, a
<
from the OCR is not escaped, creating a tag<χήρ
line 814 that makes the document ill-formed :)
It should be fixed now. The new Docker image would available for a pull in like ~40 minutes from now. Let me know is the issue persists :)
An internal validation scheme should be probably added