Closed lorelupo closed 1 year ago
Hello Lorenzo,
Thank you for your interest in DameGender. The trust source of data is https://www.istat.it/en/analysis-and-products/interactive-contents/baby-names. I don't know how we can download these data without scraping. From my point of view, https://raw.githubusercontent.com/mrblasco/genderNamesITA/master/gender_firstnames_ITA.csv must be an idea about scraping but I need external open data files to check that is a good italian dataset, for example, the births in Roma would be a good dataset test to check this idea.
If you have a budget and a specific problem I could help you speaking more by email about your task.
The files itfemales.csv and itmales.csv was downloaded from https://demo.istat.it, such as you can see in download.sh, but there very few names.
I have been calculating the accuracy of the dataset gender_firstnames_ITA.csv my calculus is 0.8166 using INTER dataset as base of truth and inferring using only first names.
Spain is 0.89 and Uruguay is 0.93. Although is low the dataset think that could good due to there are few countries using the italian language, then the deviation with the INTER could be higher.
So, names such as Rosario, Andrea, ... in italian is the opposite that the common idea.
Another data the accuracy using the 60 names of istat (most used names) as base the truth is 100%
Closing it due to inactivity. Feel free to open it if you find the issue again.
Hi David, thank you for this helpful resource.
I need to detect the gender of Italian names with the best precision possible, possibly at the expense of some recall. I noticed that you use the list of Italian names released from Istat, but that you also use a list of names from a private GitHub: https://raw.githubusercontent.com/mrblasco/genderNamesITA/master/gender_firstnames_ITA.csv
How did you aggregate these data?