Closed TomazErjavec closed 8 months ago
How should we go about solving this? Is it enough if I update ParlaMint-HR-listPerson.xml on the data branch?
Note, however, that this can be error prone, especially for persons that do not have a page on wiki or official parliamentary webpages.
How should we go about solving this? Is it enough if I update ParlaMint-HR-listPerson.xml on the data branch?
Well, you should make a pull request on the data branch, as usual. But note that we are just finishing the V4.0 release where we are taking the current files, which is why I have the "future" milestone to this issue, so this will become important for a future release. Still, you can of course fix it now and make the pull request.
Note, however, that this can be error prone, especially for persons that do not have a page on wiki or official parliamentary webpages.
Well, 98% of names clearly distinguish gender in HR, SR, BS. The res, yes, but "U" is always an option.
Peter, please construct a list of people without gender with the number of speeches they gave.
We can then assign gender by going through the list of people ordered with decreasing number of speeches given.
I would say, for people giving less than 3 speeches it is not important what their gender is.
If the list of people with unknown gender is less than 50, just prepare a list for me and I will assign where obvious.
First, the stats of utterances of ungendered speakers vs all utterances:
I'm attaching lists of all ungendered persons, sorted by descending count of utterances. RS.csv HR.csv BA.csv
This is great, @5roop, thanks! Some extra work for you, but now we can make an informed decision how to approach this. I will be taking over from here.
@nljubesi, I know you have better things to do now, also that you hate this kind of fiddling, so I tried to solve this problem. Of course, it is more difficult that it seems, as the source data (not so much HR, but BA and SR, will give the details there) have mistakes already in the original sex assignment, also, some names are mangled, with surename being marked up as forename and vice versa.
I wrote scripts to:
With this we get TSVs with person information + correct sex for those that should have their sex corrected or added.
This is the stats for HR:
In:
166 HR F
494 HR M
376 HR U
Out:
103 HR F
274 HR M
I still need to implement merging the TSVs to the corpora, but the current state is in Corpora/Sex.
HR sex has now been fixed, so, closing.
This is closed, but just wanted to say - thanks.
The current HR corpus has a lot of persons without the
<sex>
element, even though in Croatian it is easy to determine based on the person's forename. Examples: DrpićAnte, KačanMate, ŠimonovićEinwalterTena, KuharMaja etc.