Closed gclux closed 1 year ago
CZ has the same number formatting pattern, but UDPipe uses different (wrong) tokenization, so it does not influence my data.
@TomazErjavec, is it possible to allow spaces only in numbers in a schema? Something like \S+|[0-9][,0-9 ]*(?<! )
is it possible to allow spaces only in numbers in a schema? Something like \S+|[0-9][,0-9 ]*(?<! )
Um, don't know, but I think not, i.e. I think RNG patterns don't support such REs. I wouldn't complicate too much anyway, esp. as I think somebody else (LV?) complained about needing spaces in regular words.
Um, don't know, but I think not, i.e. I think RNG patterns don't support such REs. I wouldn't complicate too much anyway, esp. as I think somebody else (LV?) complained about needing spaces in regular words.
ok, then normalized spaces string...
@TomazErjavec I thought you had resolved this. Now LV sample(#544) needs it: https://github.com/clarin-eric/ParlaMint/actions/runs/4043599910/jobs/6952730808#step:4:119
I am suggesting adding sth like this:
<define name="non-empty-or-number.string">
<a:documentation>A string that is non-empty and does not contain white-space or contain number (possibly with spaces)</a:documentation>
<data type="string">
<param name="pattern">(\S+)|(\d[\.,\d ]*\d)</param>
</data>
</define>
and this pattern allow in https://github.com/clarin-eric/ParlaMint/blob/c866bf0fd898bccd6f6de0816b44a6643ce71bbd/Schema/ParlaMint-TEI.ana.rng#L242-L244 https://github.com/clarin-eric/ParlaMint/blob/c866bf0fd898bccd6f6de0816b44a6643ce71bbd/Schema/ParlaMint-TEI.ana.rng#L270-L272 https://github.com/clarin-eric/ParlaMint/blob/c866bf0fd898bccd6f6de0816b44a6643ce71bbd/Schema/ParlaMint-TEI.ana.rng#L142-L177
I thought I did too, hm, obviously not. I agree to allow it, of course, but:
- not sure about your pattern, as e.g. "1" would be illegal (the minimum length is 2)
minimum legal value length 1: (\S+)
- I would allow it only as text of w, not in lemmas; would this work?
Agree.
LV (@Skriptotajs) corpus has space in numbers. But we can probably ask them to remove spaces from lemmas
so
<w xml:id="ParlaMint-LV_2016-02-18-PT12-348-U167-P1.3.3"
pos="xn"
lemma="6 000"
msd="UPosTag=NUM|NumType=Card">6 000</w>
should be
<w xml:id="ParlaMint-LV_2016-02-18-PT12-348-U167-P1.3.3"
pos="xn"
lemma="6000"
msd="UPosTag=NUM|NumType=Card">6 000</w>
minimum legal value length 1: (\S+)
Silly me, of course.
So, did it now, just for words, not lemmas.
I have overlooked it is not only about numbers but also about foreign MWEs:
<w xml:id="ParlaMint-LV_2020-06-04-PT13-2090-U61-P1.15.28"
pos="xf"
lemma="De facto"
msd="UPosTag=X|Foreign=Yes">de facto</w>
https://github.com/clarin-eric/ParlaMint/actions/runs/4054157839/jobs/6975651049#step:4:221
I now allow spaces in words in lemmas, cf. 3376483. I reuse the existing definition of normalized-space.string, I think it should do the trick. Mod currently in documentation branch.
Merged in the data branch. closing
In France, numbers are usually formatted according to this pattern: "1 234,56" (as opposed to the English conventions: "1,234.56"). Note: in Belgium, Switzerland or Canada, there are different conventions for French.
Our tokenizer correctly recognizes these numbers, which are annotated in the following way:
As a result the validation script gives the following error: value of attribute "lemma" is invalid; must be a string matching the regular expression "\S+"
It makes sense that the lemma should not include blanks and we can replace it in this example with "3301".
But we still get this error: character content of element "w" invalid; must be a string matching the regular expression "\S+"
Why not accepting "3 301" for the text content?
As a workaround, we are now correcting the text (which is then different from the non-annotated version).