Open PierreMesure opened 10 months ago
Yes, you can make a contribution in https://github.com/davidam/damegender/tree/master/src/damegender/files/names/names_se. You can work to make a good download.sh and giving ideas measuring accuracies with Swedish names, you can to try the wikidata Swedish names or other good sources. The damegender command accuracy.py is your friend.
Good Luck!
Hi David,
Amazing project! I actually found about it after I made one based on the exact same principle based on Swedish data for my own needs. I just published the code here.
I'm both frustrated and happy I found your project (as well as name-dataset) because I couldn't find anything when I first looked and felt like I had to write my own code. But now that I've done it, I'm bummed someone implemented it better and with more data. Oh well... 😊
Anyway, I'm reaching out since I saw that you are using new born data for Sweden. I've been using a different dataset which I think works better. SCB has a list of all the names born by at least two people living in Sweden (first, middle and last names). They can be found on this page (the files called Namnsök 2021 and 2022).
I did the math and this amounts to 98% of the population (e.g. 2% of the population have a unique name and are hence not in this list). So it's way more exhaustive than the lists of newborns, even if you go back a few decades. In total, there are 97386 unique first names to compare with the 2076 in your newborn dataset.
Would you be interested in a PR to use this dataset instead?