Closed gclux closed 1 year ago
canceled
Why not accepting "3 301" for the text content?
Yes, you are right, we should allow internal spaces in words. I will fix in documentation branch right away.
As a workaround, we are now correcting the text (which is then different from the non-annotated version).
No need for that, if your pipeline correctly recongises full words, no need to fix. But for lemmas, I think it makes sense not to allow spaces.
In France, numbers are usually formatted according to this pattern: "1 234,56" (as opposed to the English conventions: "1,234.56"). Note: in Belgium, Switzerland or Canada, there are different conventions for French.
Our tokenizer correctly recognizes these numbers, which are annotated in the following way: <w xml:id="ParlaMint-FR_2017-09-25-E2001.s27.w66" msd="UPosTag=NUM|Number=Plur" lemma="3 301">3 301
As a result the validation script gives the following error: value of attribute "lemma" is invalid; must be a string matching the regular expression "\S+"
It makes sense that the lemma should not include blanks and we can replace it in this example with "3301".
But we still get this error: character content of element "w" invalid; must be a string matching the regular expression "\S+"
Why not accepting "3 301" for the text content?
As a workaround, we are now correcting the text (which is then different from the non-annotated version).