davidam / damegender

Gender detection toolkit from names written in python and bash
https://damegender.davidam.com
GNU General Public License v3.0
2 stars 0 forks source link

Italian names aggregation #30

Closed lorelupo closed 1 year ago

lorelupo commented 1 year ago

Hi David, thank you for this helpful resource.

I need to detect the gender of Italian names with the best precision possible, possibly at the expense of some recall. I noticed that you use the list of Italian names released from Istat, but that you also use a list of names from a private GitHub: https://raw.githubusercontent.com/mrblasco/genderNamesITA/master/gender_firstnames_ITA.csv

How did you aggregate these data?

davidam commented 1 year ago

Hello Lorenzo,

Thank you for your interest in DameGender. The trust source of data is https://www.istat.it/en/analysis-and-products/interactive-contents/baby-names. I don't know how we can download these data without scraping. From my point of view, https://raw.githubusercontent.com/mrblasco/genderNamesITA/master/gender_firstnames_ITA.csv must be an idea about scraping but I need external open data files to check that is a good italian dataset, for example, the births in Roma would be a good dataset test to check this idea.

If you have a budget and a specific problem I could help you speaking more by email about your task.

davidam commented 1 year ago

The files itfemales.csv and itmales.csv was downloaded from https://demo.istat.it, such as you can see in download.sh, but there very few names.

davidam commented 1 year ago

I have been calculating the accuracy of the dataset gender_firstnames_ITA.csv my calculus is 0.8166 using INTER dataset as base of truth and inferring using only first names.

Spain is 0.89 and Uruguay is 0.93. Although is low the dataset think that could good due to there are few countries using the italian language, then the deviation with the INTER could be higher.

So, names such as Rosario, Andrea, ... in italian is the opposite that the common idea.

Another data the accuracy using the 60 names of istat (most used names) as base the truth is 100%

davidam commented 1 year ago

Closing it due to inactivity. Feel free to open it if you find the issue again.