clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
41 stars 52 forks source link

BG: listPerson and listOrg elements wrongly marked for language #733

Closed TomazErjavec closed 1 year ago

TomazErjavec commented 1 year ago

In connection with #732 the program complained about the strings below. They should mostly have their element marked with xml:lang="en", rather than being without xml:lang meaning that they are in Bulrgarian. For cases where it looks like Cyrillic was used but the program still complains, this means that the language is ok, but at least one of the letters is in Latin, rather than Cyrillic, as it should be.

Could you pls. fix ParlaMint-BG-listPerson.xml and ParlaMint-BG-listOrg.xml to correct this. Note that the list below gives just one error per string, even though there could be several occurences of the string/element combination in the two files.

ERROR: "ABV" in orgName abb marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "AP" in orgName abb marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Ardino" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Asenovgrad" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Aytos" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "BDC-NU" in orgName abb marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Blagoevgrad" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Botevgrad" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "BSPFB" in orgName abb marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "BSPLB" in orgName abb marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Burgas" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "CHirpan" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "DABG" in orgName abb marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "DB" in orgName abb marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Dimitrovgrad" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Dobrich" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Dobrinishte" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "DSB" in orgName abb marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Dupnitsa" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Elhovo" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Elin Pelin" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Etropole" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Gabrovo" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Galabovo" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "GERB" in orgName abb marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "GERB-UDF" in orgName abb marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Gorna Oryahovitsa" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Gotse Delchev" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "GrMov" in orgName abb marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Haskovo" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Ihtiman" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Isperih" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "ITN" in orgName abb marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Kardzhali" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Karlovo" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Kazanlak" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Knezha" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Koynare" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Kubrat" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Kyustendil" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Lom" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Lovech" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Madan" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Maritsa" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "minister" in occupation marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Momchilgrad" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Monreal" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Montana" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "MRF" in orgName abb marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Oryahovo" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Panagyurishte" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Parvomay" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Pazardzhik" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Pernik" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Petrich" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "PF" in orgName abb marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Pirdop" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Pleven" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Plovdiv" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Polski Trambezh" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Popovo" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Razgrad" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Razlog" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "RB" in orgName abb marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "RP" in orgName abb marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "RUBGWC" in orgName abb marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Ruse" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "RUTO" in orgName abb marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Samokov" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Sandanski" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Sapienza University" in education marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "s. Blaskovo" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "s. Brezhani" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "s. Breznitsa" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Sevlievo" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "s. Glava" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "s. Golyamo Vranovo" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Shumen" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Silistra" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "s. Karamantsi" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "s. Kutovo" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Sliven" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Smolyan" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "s. Novachene" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "s. Nova Kamena" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "s. Novo selo" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Sofia" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "s. Okorsh" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "s. Raduil" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "s. Samovodene" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "s. Slavotin" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "s. Srem" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Stara Zagora" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "s. Tranak" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Svilengrad" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Svishtov" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "s. Vodach" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Targovishte" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "TISP" in orgName abb marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Tryavna" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "UP" in orgName abb marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Usogorsk" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Varna" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Veliko Tarnovo" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Velingrad" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Vidin" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "VOLYA" in orgName abb marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Vratsa" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "WCC" in orgName abb marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Yambol" in placeName marked as bg but script is (also) Latin, fixing language to bg-Latn
ERROR: "Военна академия "Г.C. Pаковски", гр. София, магистър по военно дело" in education marked as bg but script is (also) Latin, fixing language to bg-Latn
TomazErjavec commented 1 year ago

Thank you, this has been fixed except for:

ERROR: "Sapienza University" in education marked as bg but script is (also) Latin, fixing language to bg-Lat
ERROR: "minister" in occupation marked as bg but script is (also) Latin, fixing language to bg-Latn

The first one seems wrong anyway, as education (also in other places in the BG corpus) is what somebody has studied, and not where. The second one appears only once in listPerson, otherwise occupation is given in Bulgarian, not English.

So, @osenova, could you pls. fix the two errors and resend (or pull-request into GitHub and let me know) your listPerson please?

osenova commented 1 year ago

ParlaMint-BG.zip Hi Tomaz, the errors were fixed, and I am sending the file here. If there are more however of this kind, please contact Kiril or both of us.

TomazErjavec commented 1 year ago

Thank you @osenova, it looks good now. If anything else crops up, I will let you both know, sorry!