clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
43 stars 53 forks source link

Validation error: character content of element "w" invalid; must be a string matching the regular expression "\S+" #567

Closed gclux closed 1 year ago

gclux commented 1 year ago

In France, numbers are usually formatted according to this pattern: "1 234,56" (as opposed to the English conventions: "1,234.56"). Note: in Belgium, Switzerland or Canada, there are different conventions for French.

Our tokenizer correctly recognizes these numbers, which are annotated in the following way:

                     <w xml:id="ParlaMint-FR_2017-09-25-E2001.s27.w66"
                        msd="UPosTag=NUM|Number=Plur"
                        lemma="3 301">3 301</w>

As a result the validation script gives the following error: value of attribute "lemma" is invalid; must be a string matching the regular expression "\S+"

It makes sense that the lemma should not include blanks and we can replace it in this example with "3301".

But we still get this error: character content of element "w" invalid; must be a string matching the regular expression "\S+"

Why not accepting "3 301" for the text content?

As a workaround, we are now correcting the text (which is then different from the non-annotated version).

matyaskopp commented 1 year ago

CZ has the same number formatting pattern, but UDPipe uses different (wrong) tokenization, so it does not influence my data. image

@TomazErjavec, is it possible to allow spaces only in numbers in a schema? Something like \S+|[0-9][,0-9 ]*(?<! )

TomazErjavec commented 1 year ago

is it possible to allow spaces only in numbers in a schema? Something like \S+|[0-9][,0-9 ]*(?<! )

Um, don't know, but I think not, i.e. I think RNG patterns don't support such REs. I wouldn't complicate too much anyway, esp. as I think somebody else (LV?) complained about needing spaces in regular words.

matyaskopp commented 1 year ago

Um, don't know, but I think not, i.e. I think RNG patterns don't support such REs. I wouldn't complicate too much anyway, esp. as I think somebody else (LV?) complained about needing spaces in regular words.

ok, then normalized spaces string...

matyaskopp commented 1 year ago

@TomazErjavec I thought you had resolved this. Now LV sample(#544) needs it: https://github.com/clarin-eric/ParlaMint/actions/runs/4043599910/jobs/6952730808#step:4:119

I am suggesting adding sth like this:

  <define name="non-empty-or-number.string">
    <a:documentation>A string that is non-empty and does not contain white-space or contain number (possibly with spaces)</a:documentation>
    <data type="string">
      <param name="pattern">(\S+)|(\d[\.,\d ]*\d)</param>
    </data>
  </define>

and this pattern allow in https://github.com/clarin-eric/ParlaMint/blob/c866bf0fd898bccd6f6de0816b44a6643ce71bbd/Schema/ParlaMint-TEI.ana.rng#L242-L244 https://github.com/clarin-eric/ParlaMint/blob/c866bf0fd898bccd6f6de0816b44a6643ce71bbd/Schema/ParlaMint-TEI.ana.rng#L270-L272 https://github.com/clarin-eric/ParlaMint/blob/c866bf0fd898bccd6f6de0816b44a6643ce71bbd/Schema/ParlaMint-TEI.ana.rng#L142-L177

TomazErjavec commented 1 year ago

I thought I did too, hm, obviously not. I agree to allow it, of course, but:

matyaskopp commented 1 year ago
  • not sure about your pattern, as e.g. "1" would be illegal (the minimum length is 2)

minimum legal value length 1: (\S+)

  • I would allow it only as text of w, not in lemmas; would this work?

Agree.

LV (@Skriptotajs) corpus has space in numbers. But we can probably ask them to remove spaces from lemmas

so

                     <w xml:id="ParlaMint-LV_2016-02-18-PT12-348-U167-P1.3.3"
                        pos="xn"
                        lemma="6 000"
                        msd="UPosTag=NUM|NumType=Card">6 000</w>

should be

                     <w xml:id="ParlaMint-LV_2016-02-18-PT12-348-U167-P1.3.3"
                        pos="xn"
                        lemma="6000"
                        msd="UPosTag=NUM|NumType=Card">6 000</w>
TomazErjavec commented 1 year ago

minimum legal value length 1: (\S+)

Silly me, of course.

So, did it now, just for words, not lemmas.

matyaskopp commented 1 year ago

I have overlooked it is not only about numbers but also about foreign MWEs:

                     <w xml:id="ParlaMint-LV_2020-06-04-PT13-2090-U61-P1.15.28"
                        pos="xf"
                        lemma="De facto"
                        msd="UPosTag=X|Foreign=Yes">de facto</w>

https://github.com/clarin-eric/ParlaMint/actions/runs/4054157839/jobs/6975651049#step:4:221

TomazErjavec commented 1 year ago

I now allow spaces in words in lemmas, cf. 3376483. I reuse the existing definition of normalized-space.string, I think it should do the trick. Mod currently in documentation branch.

matyaskopp commented 1 year ago

Merged in the data branch. closing