clarinsi / jos2ud

1 stars 0 forks source link

Validation: metadata errors #10

Closed kajad closed 5 years ago

kajad commented 5 years ago

The official validation script (see http://quest.ms.mff.cuni.cz/cgi-bin/zeman/unidep/validation-report.pl) reports on the following metadata errors that need to be fixed in the conversion script:

[Tree number 4763 on line 97325 Sent ssj325.1933.6876]: Mismatch between the text attribute and the FORM field. Form[13] is '"' but text is ' "Vesele pletilje"....'
[Line 97342 Sent ssj325.1933.6876]: Extra characters at the end of the text attribute, not accounted for in the FORM fields: ' "Vesele pletilje".'
[Tree number 5855 on line 119869 Sent ssj434.2395.8407]: Mismatch between the text attribute and the FORM field. Form[13] is '"' but text is ' "Tzigane"....'
[Line 119885 Sent ssj434.2395.8407]: Extra characters at the end of the text attribute, not accounted for in the FORM fields: ' "Tzigane".'
[Tree number 6166 on line 125339 Sent ssj455.2503.8811]: Mismatch between the text attribute and the FORM field. Form[8] is '"' but text is ' "Sveti aliansi" vzho...'
[Line 125360 Sent ssj455.2503.8811]: Extra characters at the end of the text attribute, not accounted for in the FORM fields: ' "Sveti aliansi" vzhodnih sil, ki je zdaj za vselej mrtva.'

From what I can tell, there is a SpaceAfter=No info added to tokens preceding quotation marks, although there is a space after these tokens, as in the # text representation of the sentence.

TomazErjavec commented 5 years ago

The error is in fact in the official TEI ssj500k 2.1 release, where a named entity sometimes starts with a space, which it never should. So, the fix should be made in the scope of producing a new version of ssj500k, which I will proceed to do. More to follow..

TomazErjavec commented 5 years ago

ssj500k fixed (so, made draft 2.2), and with it these errors are now gone.