clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
41 stars 52 forks source link

Political orientation differences between wiki and enco tsv files #775

Closed AnnaParla closed 1 year ago

AnnaParla commented 1 year ago

When political orientations for a party in wiki and enco tsv files do not match, which orientation for that party will end up in the concordancers, wiki-based or enco-based, or both? Also, will the wiki tsv files have retrieval dates? The latter question stems from the observation that positions of some parties were revised in Wikipedia between the time wiki tsv files were harvested and now.

TomazErjavec commented 1 year ago

Interesting questions, adding other hopefully interested parties to this issue.

When political orientations for a party in wiki and enco tsv files do not match, which orientation for that party will end up in the concordancers, wiki-based or enco-based, or both?

So far, neither, as I haven't added this to the concordancers yet but I should. My idea was to take the Wiki, if it is missing, then take enco. We could have both but there are so many speech attributes already that it seems a bit of overload. Also, only 3 corpora have them, of these only two more or less populated, yours and PT, where I just noticed a bug, at least as far as GitHub is concerned...

Also, will the wiki tsv files have retrieval dates? The latter question stems from the observation that positions of some parties were revised in Wikipedia between the time wiki tsv files were harvested and now.

Argh! No, we don't have the dates, it never occured to me that, of course, they are not set in stone. It is ironic that we have dates on almost everything in the teiHeader, except Wiki (and enco) pol. orientations, my bad. But maybe its a bit late in the day to check them all again (over 1,000!), at least for 3.1. (N.B. it should really be v4!)

matyaskopp commented 1 year ago

When political orientations for a party in wiki and enco tsv files do not match, which orientation for that party will end up in the concordancers, wiki-based or enco-based, or both?

So far, neither, as I haven't added this to the concordancers yet but I should. My idea was to take the Wiki, if it is missing, then take enco. We could have both but there are so many speech attributes already that it seems a bit of overload. Also, only 3 corpora have them, of these only two more or less populated, yours and PT, where I just noticed a bug, at least as far as GitHub is concerned...

I expected it should be done otherwise, because enco file is encoded by persons who know the specific of parliament and the language better.

Also, will the wiki tsv files have retrieval dates? The latter question stems from the observation that positions of some parties were revised in Wikipedia between the time wiki tsv files were harvested and now.

This is a really good point. Another feature for fanatic volunteers is to get the exact wiki user responsible for assigning the category :-)

but agree with @TomazErjavec , that it is probably too late to introduce it

katjameden commented 1 year ago

When political orientations for a party in wiki and enco tsv files do not match, which orientation for that party will end up in the concordancers, wiki-based or enco-based, or both?

So far, neither, as I haven't added this to the concordancers yet but I should. My idea was to take the Wiki, if it is missing, then take enco. We could have both but there are so many speech attributes already that it seems a bit of overload. Also, only 3 corpora have them, of these only two more or less populated, yours and PT, where I just noticed a bug, at least as far as GitHub is concerned...

I expected it should be done otherwise, because enco file is encoded by persons who know the specific of parliament and the language better.

Yes, I agree to some extent, but I would also assume that a more solid resource with better coverage would be better - as @TomazErjavec noted, only three corpora have enco values, so Wikipedia would probably be better in terms of better coverage. Personally, I would also suggest using the CHES values (specifically the LRGEN value only) as a primary source, since the values come from an expert dataset with a clear methodology that also tracks changes in the orientation of some parties over the years. However, the values are numerical on a scale of 0-10, while Wikipedia provides labels, so I am not sure how that would translate to the concordancer.

AnnaParla commented 1 year ago

When political orientations for a party in wiki and enco tsv files do not match, which orientation for that party will end up in the concordancers, wiki-based or enco-based, or both?

So far, neither, as I haven't added this to the concordancers yet but I should. My idea was to take the Wiki, if it is missing, then take enco. We could have both but there are so many speech attributes already that it seems a bit of overload. Also, only 3 corpora have them, of these only two more or less populated, yours and PT, where I just noticed a bug, at least as far as GitHub is concerned...

I expected it should be done otherwise, because enco file is encoded by persons who know the specific of parliament and the language better.

Yes, I agree to some extent, but I would also assume that a more solid resource with better coverage would be better - as @TomazErjavec noted, only three corpora have enco values, so Wikipedia would probably be better in terms of better coverage.

Which Wikipedia entries shall we consider better, those in English or in Ukrainian? :) The thing is that there are noticeable differences between en.wiki and uk,wiki in terms of the presence / absence of an entry for a specific political entity, i.e. party/group/faction (overall, more entities in focus are covered in uk,wiki, but some are not covered in either language), references to different sources used (overall, the quality of the references varies a lot in both languages but may be a bit more solid in uk,wiki due to citing local experts / media / party programs) and the application of labels. As for the latter, en.wiki is more consistent in including political positions into the info box, while in uk,wiki they are often mentioned in the body of the text. Also, en.wiki labels have a wider range (e.g., CCL, BT, SY) and uk,wiki entries prefer the L-CL-C-CR-R scale. As a result, there are some mismatches in labels between en.wiki and uk,wiki (these cases are marked and commented on in our metadata spreadsheet).

On the other hand, my enco labels are "golden-mean" approximations based on both en.wiki and uk,wiki entries as well as many other sources including analytical reports and publications by Ukrainian think tanks and research institutes, interviews with party leaders / members in the Ukrainian media and party webpages. However, the enco labels also need to be taken with a grain of salt, because since the 2000s Ukrainian parties have been undergoing de-ideologization. Now many of them have no clear-cut ideologies and centre around civilizational and geostrategic orientations, individual politicians or business interests.

Personally, I would also suggest using the CHES values (specifically the LRGEN value only) as a primary source, since the values come from an expert dataset with a clear methodology that also tracks changes in the orientation of some parties over the years. However, the values are numerical on a scale of 0-10, while Wikipedia provides labels, so I am not sure how that would translate to the concordancer.

Unfortunately, Ukraine is not included into the CHES surveys. Some local surveys differentiate between “lrecon” and something resembling “galtan”, but their "lrgen" results vary drastically due to different applications of the _salience value. Also, local surveys cover only a few most popular parties ahead of elections and there were 349 political parties on record at the country's Single Registry as of 1 January 2020. Given the methodological inconsistency and small coverage of the local surveys, I don't see an alternative to imperfectly approximated labels in these circumstances.

Hopefully, we will have it all nailed down in v4, as @TomazErjavec suggests :)

jureskubic commented 1 year ago

When political orientations for a party in wiki and enco tsv files do not match, which orientation for that party will end up in the concordancers, wiki-based or enco-based, or both?

I agree with Tomaž - firstly Wiki, if not available, enco. CHES would be a good idea if it provided information for all the parties we worked on, but as Anna pointed out, UA is not included as well as some others are not. As for the language, we did it this way: we firstly considered the EN Wiki page, if it did not provide information, we turned to the Wiki page in country language. I am aware of the fact that this means potential problems especially in UA, but this is how methodology was set and it would in my opinion be inconsistent if we now opted for country language Wiki page first instead of the EN. EN pages also produce a wider range of orientation labels which is important for our work. There will always be a few mismatches and we took that into account and I think this is academically and methodologically perfectly justifiable since we would otherwise have a lot of problems.

Also, will the wiki tsv files have retrieval dates? The latter question stems from the observation that positions of some parties were revised in Wikipedia between the time wiki tsv files were harvested and now.

It's a shame we didn't think of that sooner but I agree, perhaps a bit late now. As for the future, I think it would be a good idea to have it, even if it means a large amount of initial work.

TomazErjavec commented 1 year ago

Thanks, all, for the extensive comments! Given the sparsity of data in CHES and its numeric values, I think that is out, so the only change to my suggestion (first Wiki, then enco) could be that we first take enco, and then Wiki. The @AnnaParla could have her orientations in the corpus, most others Wiki. Would you agree?

AnnaParla commented 1 year ago

@TomazErjavec After multiple changes in the English language Wikipedia re Ukrainian parties between June 2023 and now (and not all those changes were introduced by me:-), the available en.wiki orientations retrieved / checked on 19 September 2023 (https://github.com/clarin-eric/ParlaMint/commit/f703ac03989dd597e119322f73ce4393173cf059) are quite similar to enco. Out of 115 labeled orientations only 7 differ in shades between the wiki and enco files, with 4 of them being party / faction pairs. In short, if you need to go with wiki files first for the whole project for the sake of consistency, I do not mind!

TomazErjavec commented 1 year ago

OK, @AnnaParla, thanks. Then let's leave it as it is, i.e. first Wiki, then enco. So, closing, thanks for all you comments!