clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
50 stars 53 forks source link

UA: listPerson and listOrg elements wrongly marked for language #734

Closed TomazErjavec closed 1 year ago

TomazErjavec commented 1 year ago

In connection with #732 the program complained about the strings below. They should have their element marked with xml:lang="en", rather than being without xml:lang meaning that they are in Ukrainian. For cases where it looks like Cyrillic was used but the program still complains, this means that at least one of the letters is in Latin, rather than Cyrillic, as it should be.

Could you pls. fix ParlaMint-UA-listPerson.xml and ParlaMint-UA-listOrg.xml to correct this.

ERROR: "Secretary General of NATO" in occupation marked as uk but script is (also) Latin, fixing language to uk-Latn
ERROR: "MP" in occupation marked as uk but script is (also) Latin, fixing language to uk-Latn
ERROR: "MP" in occupation marked as uk but script is (also) Latin, fixing language to uk-Latn
ERROR: "MP" in occupation marked as uk but script is (also) Latin, fixing language to uk-Latn
ERROR: "President of Poland" in occupation marked as uk but script is (also) Latin, fixing language to uk-Latn
ERROR: "MP" in occupation marked as uk but script is (also) Latin, fixing language to uk-Latn
ERROR: "Prime Minister of the United Kingdom and Leader of the Conservative Party" in occupation marked as uk but script is (also) Latin, fixing language to uk-Latn
ERROR: "MP" in occupation marked as uk but script is (also) Latin, fixing language to uk-Latn
ERROR: "cуддя Вінницького окружного адміністративного суду" in occupation marked as uk but script is (also) Latin, fixing language to uk-Latn
ERROR: "MP" in occupation marked as uk but script is (also) Latin, fixing language to uk-Latn
ERROR: "MP" in occupation marked as uk but script is (also) Latin, fixing language to uk-Latn
ERROR: "MP" in occupation marked as uk but script is (also) Latin, fixing language to uk-Latn
ERROR: "MP" in occupation marked as uk but script is (also) Latin, fixing language to uk-Latn
matyaskopp commented 1 year ago

This is now fixed in the pipeline: https://github.com/clarin-eric/ParlaMint/pull/736/commits/7f8d5a90ebafb87c2b8e3841f0ca1bc6703c4b9c#diff-6d49b0da9a308492483579593030e4c413a19d7f07bb6651a8c4f2fab2fd39fe

matyaskopp commented 1 year ago

I believe all except

ERROR: "cуддя Вінницького окружного адміністративного суду" in occupation marked as uk but script is (also) Latin, fixing language to uk-Latn

is fixed. c at the beginning should be с: image

leaving this open, need to improve validations in GoogleSheet

TomazErjavec commented 1 year ago

Noticed this too, corrected my copies of listPerson, so no need to resend because of this, just close when fixed at your end.

matyaskopp commented 1 year ago

Fixed in our spreadsheet and updated sample: https://github.com/clarin-eric/ParlaMint/pull/762/commits/5545d5f444f8655708c70b4c09d98809223d4e41