clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
41 stars 52 forks source link

UA political orientation + patch tsv2tei script #770

Closed matyaskopp closed 1 year ago

matyaskopp commented 1 year ago

UA corpus preserved IDs but changed some abbreviated names.

The idea is to add the prefix # and use IDs matching when prefixes are in the data.

TomazErjavec commented 1 year ago

@matyaskopp, I've now:

With this, the party names are recognised, however, some are missing from the TSVs, both Wiki and enco ones. Maybe @AnnaParla could add these?

ERROR: For ParlaMint-UA-listOrg cant find party НСНУ (pp.nsnu) in Wiki TSV
ERROR: For ParlaMint-UA-listOrg cant find party УРДП (pp.urdp) in Wiki TSV
ERROR: For ParlaMint-UA-listOrg cant find party НДП (pp.ndp) in Wiki TSV
ERROR: For ParlaMint-UA-listOrg cant find party Позиція (pp.cp) in Wiki TSV
ERROR: For ParlaMint-UA-listOrg cant find party Справедливість (pp.justice) in Wiki TSV
ERROR: For ParlaMint-UA-listOrg cant find party фСДПУ(о) (fr.sdpuo) in Wiki TSV

and

ERROR: For ParlaMint-UA-listOrg cant find party НСНУ (pp.nsnu) in encoder TSV
ERROR: For ParlaMint-UA-listOrg cant find party УРДП (pp.urdp) in encoder TSV
ERROR: For ParlaMint-UA-listOrg cant find party НДП (pp.ndp) in encoder TSV
ERROR: For ParlaMint-UA-listOrg cant find party Позиція (pp.cp) in encoder TSV
ERROR: For ParlaMint-UA-listOrg cant find party Справедливість (pp.justice) in encoder TSV

Note that all parties / parl. groups should be in the TSVs even if e.g. you can't find their Wiki page; all the values that can't be determined should have the hyphen as their content.

matyaskopp commented 1 year ago

endlines fixed: I have added patch to makefile target that loads updated ukrainian data: https://github.com/clarin-eric/ParlaMint/blob/9d8ef3805162765fd20282275a65c1a3742a0fcb/Corpora/Orientations/Makefile#L104-L108

AnnaParla commented 1 year ago

@matyaskopp, I've now:

  • deleted orientations-tsv2tei.xsl (this was an obsolete script)
  • fixed wiki-tsv2tei.xsl and enco-tsv2tei.xsl so that they take # into account (actually, they just throw it away, and match for ID)
  • modified the scripts so they work for DOS end-of-lines (but it would be better to convert TSVs to Unix first, as we otherwise always use Unix EOLs, also other countries TSVs are Unix)

With this, the party names are recognised, however, some are missing from the TSVs, both Wiki and enco ones. Maybe @AnnaParla could add these?

ERROR: For ParlaMint-UA-listOrg cant find party НСНУ (pp.nsnu) in Wiki TSV
ERROR: For ParlaMint-UA-listOrg cant find party УРДП (pp.urdp) in Wiki TSV
ERROR: For ParlaMint-UA-listOrg cant find party НДП (pp.ndp) in Wiki TSV
ERROR: For ParlaMint-UA-listOrg cant find party Позиція (pp.cp) in Wiki TSV
ERROR: For ParlaMint-UA-listOrg cant find party Справедливість (pp.justice) in Wiki TSV
ERROR: For ParlaMint-UA-listOrg cant find party фСДПУ(о) (fr.sdpuo) in Wiki TSV

and

ERROR: For ParlaMint-UA-listOrg cant find party НСНУ (pp.nsnu) in encoder TSV
ERROR: For ParlaMint-UA-listOrg cant find party УРДП (pp.urdp) in encoder TSV
ERROR: For ParlaMint-UA-listOrg cant find party НДП (pp.ndp) in encoder TSV
ERROR: For ParlaMint-UA-listOrg cant find party Позиція (pp.cp) in encoder TSV
ERROR: For ParlaMint-UA-listOrg cant find party Справедливість (pp.justice) in encoder TSV

Note that all parties / parl. groups should be in the TSVs even if e.g. you can't find their Wiki page; all the values that can't be determined should have the hyphen as their content.

These parties and their wiki urls were added to the org page of our metadata google spreadsheet as part of the Ukrainian parliamentary proceedings extension project covering 2002-2012 (Terms 4-6), but they are not relevant for the 2012-2023 (Terms 7-9) timespan for the ParlaMint-UA corpus.

Shall I add them to the Wiki and enco TSVs for the ParlaMint-UA corpus (2012-2023) anyway?

TomazErjavec commented 1 year ago

Shall I add them to the Wiki and enco TSVs for the ParlaMint-UA corpus (2012-2023) anyway?

Yes please. As I wrote:

Note that all parties / parl. groups should be in the TSVs ...; all the values that can't be determined should have the hyphen as their content.

In other words, you should put - in all cells except the country and party ID. This is just to get rid of the error messages, and that you have a complete list of parties in the TSV.

TomazErjavec commented 1 year ago

This all seems to work now, no errors. So, closing.