clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
41 stars 52 forks source link

BA: missing person gender #817

Closed TomazErjavec closed 6 months ago

TomazErjavec commented 11 months ago

The current BA corpus has of persons without the <sex> element, even though it is is easy to determine based on the person's forename.

TomazErjavec commented 6 months ago

As explained in #815, there is now a procedure in place to add missing gender. But before doing that, I noticed that some forename and surenames are mixed up in BA, I just tested forenames ending in '-ić' and corrected those. I'm sure there are more that I haven't noticed. As a side-effed of this fixed, we have now doubled persons - once, where the order is (was) correct, and one where it was corrected - but even though they now have the correct names, they still have different person IDs (e.g. BoškoŠiljeković vs. ŠiljekovićBoško). Hopefully this will be fixed in the future. This is the list of changes:

284,285c284,285
<          <surname>Ristan</surname>
<          <forename>Ristić</forename>
---
>          <surname>Ristić</surname>
>          <forename>Ristan</forename>
7692,7693c7692,7693
<          <surname>Boško</surname>
<          <forename>Šiljeković</forename>
---
>          <surname>Šiljeković</surname>
>          <forename>Boško</forename>
8052,8053c8052,8053
<          <surname>Dragutin</surname>
<          <forename>Ilić</forename>
---
>          <surname>Ilić</surname>
>          <forename>Dragutin</forename>
8250,8251c8250,8251
<          <surname>Anto</surname>
<          <forename>Spajić</forename>
---
>          <surname>Spajić</surname>
>          <forename>Anto</forename>
8856,8857c8856,8857
<          <surname>Jadranko</surname>
<          <forename>Tomić</forename>
---
>          <surname>Tomić</surname>
>          <forename>Jadranko</forename>
9048,9049c9048,9049
<          <surname>Muharem</surname>
<          <forename>Imamović</forename>
---
>          <surname>Imamović</surname>
>          <forename>Muharem</forename>
nljubesi commented 6 months ago

Thanks for this as well. The different IDs are due to you exchanging the surname and forename, but not changing the ID? Would that not be simple to resolve once we know what is the correct name and what the correct surname?

We did not do any work on the Bosnian data ourselves, but have obtained them from our upstream source, so are very unknowledgeable on what issues the data might have.

TomazErjavec commented 6 months ago

The different IDs are due to you exchanging the surname and forename, but not changing the ID?

Yes, exactly.

Would that not be simple to resolve once we know what is the correct name and what the correct surname?

Well, it is simple in that it is clear what needs to be done - but you need to go through all the files and replace, so with some testing that nothing is messed up it might take a while. More than I would gladly invest, esp. as these mistakes cropped up by chance, who knows how many others are lurking in there...

However, I will re-open this issue, maybe somebody finds the time. Just in case it would be @nljubesi , let me know beforehand, as the source data has now been fixed and you need to get that copy.

TomazErjavec commented 6 months ago

Sex has been added to BA, as regards wrong forename/surname distinction, this is now discussed in #852.