clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
43 stars 53 forks source link

Validation error: character content of element "w" invalid; must be a string matching the regular expression "\S+" #566

Closed gclux closed 1 year ago

gclux commented 1 year ago

In France, numbers are usually formatted according to this pattern: "1 234,56" (as opposed to the English conventions: "1,234.56"). Note: in Belgium, Switzerland or Canada, there are different conventions for French.

Our tokenizer correctly recognizes these numbers, which are annotated in the following way: <w xml:id="ParlaMint-FR_2017-09-25-E2001.s27.w66" msd="UPosTag=NUM|Number=Plur" lemma="3 301">3 301

As a result the validation script gives the following error: value of attribute "lemma" is invalid; must be a string matching the regular expression "\S+"

It makes sense that the lemma should not include blanks and we can replace it in this example with "3301".

But we still get this error: character content of element "w" invalid; must be a string matching the regular expression "\S+"

Why not accepting "3 301" for the text content?

As a workaround, we are now correcting the text (which is then different from the non-annotated version).

gclux commented 1 year ago

canceled

TomazErjavec commented 1 year ago

Why not accepting "3 301" for the text content?

Yes, you are right, we should allow internal spaces in words. I will fix in documentation branch right away.

As a workaround, we are now correcting the text (which is then different from the non-annotated version).

No need for that, if your pipeline correctly recongises full words, no need to fix. But for lemmas, I think it makes sense not to allow spaces.