MedKhem / grobid-dictionaries

31 stars 7 forks source link

Well formation error in the final TEI output #24

Open MedKhem opened 6 years ago

MedKhem commented 6 years ago

An internal validation scheme should be probably added

PonteIneptique commented 4 years ago

Hey there, just bumping as I ended up having the situation with an ill-formed document :) Any way to avoid this ? The document and the training+evaluation data are available here : https://github.com/lascivaroma/lexical/tree/master/grobid-data/Dictionaire%20des%20synonymes%20latins

MedKhem commented 4 years ago

hey! what would you consider as ill-formed structures? the sequence of \<fw>s?

PonteIneptique commented 4 years ago

In this specific case, a < from the OCR is not escaped, creating a tag <χήρ line 814 that makes the document ill-formed :)

MedKhem commented 4 years ago

In this specific case, a < from the OCR is not escaped, creating a tag <χήρ line 814 that makes the document ill-formed :)

It should be fixed now. The new Docker image would available for a pull in like ~40 minutes from now. Let me know is the issue persists :)