clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
41 stars 52 forks source link

HR: missing person gender #815

Closed TomazErjavec closed 6 months ago

TomazErjavec commented 11 months ago

The current HR corpus has a lot of persons without the <sex> element, even though in Croatian it is easy to determine based on the person's forename. Examples: DrpićAnte, KačanMate, ŠimonovićEinwalterTena, KuharMaja etc.

5roop commented 11 months ago

How should we go about solving this? Is it enough if I update ParlaMint-HR-listPerson.xml on the data branch?

Note, however, that this can be error prone, especially for persons that do not have a page on wiki or official parliamentary webpages.

TomazErjavec commented 11 months ago

How should we go about solving this? Is it enough if I update ParlaMint-HR-listPerson.xml on the data branch?

Well, you should make a pull request on the data branch, as usual. But note that we are just finishing the V4.0 release where we are taking the current files, which is why I have the "future" milestone to this issue, so this will become important for a future release. Still, you can of course fix it now and make the pull request.

Note, however, that this can be error prone, especially for persons that do not have a page on wiki or official parliamentary webpages.

Well, 98% of names clearly distinguish gender in HR, SR, BS. The res, yes, but "U" is always an option.

nljubesi commented 11 months ago

Peter, please construct a list of people without gender with the number of speeches they gave.

We can then assign gender by going through the list of people ordered with decreasing number of speeches given.

I would say, for people giving less than 3 speeches it is not important what their gender is.

If the list of people with unknown gender is less than 50, just prepare a list for me and I will assign where obvious.

5roop commented 11 months ago

First, the stats of utterances of ungendered speakers vs all utterances:

I'm attaching lists of all ungendered persons, sorted by descending count of utterances. RS.csv HR.csv BA.csv

nljubesi commented 11 months ago

This is great, @5roop, thanks! Some extra work for you, but now we can make an informed decision how to approach this. I will be taking over from here.

TomazErjavec commented 6 months ago

@nljubesi, I know you have better things to do now, also that you hate this kind of fiddling, so I tried to solve this problem. Of course, it is more difficult that it seems, as the source data (not so much HR, but BA and SR, will give the details there) have mistakes already in the original sex assignment, also, some names are mangled, with surename being marked up as forename and vice versa.

I wrote scripts to:

  • make a TSV with relevant person information, incl. sex
  • take this TSV and correct the sex of those which were seen to be sex-ambiguous in the source data, but we know are not, e.g. "Andrija" is a M name, and should not be marked as F in the corpus, but it is.
  • give the sex to those without, if their forename was found in the corpus with marked sex, else
  • give sex based on the list of exceptions (e.g. Jovica is M), else
  • give sex based on the -a ending

With this we get TSVs with person information + correct sex for those that should have their sex corrected or added.

This is the stats for HR:

In:
    166 HR      F
    494 HR      M
    376 HR      U
Out:
    103 HR      F
    274 HR      M

I still need to implement merging the TSVs to the corpora, but the current state is in Corpora/Sex.

TomazErjavec commented 6 months ago

HR sex has now been fixed, so, closing.

nljubesi commented 6 months ago

This is closed, but just wanted to say - thanks.