JuliaText / NameToGender.jl

Guess gender based on first name
Other
2 stars 0 forks source link

Improve the dataset #2

Open aviks opened 6 years ago

aviks commented 6 years ago

This is more of feedback than an issue, not sure its actionable.

I tried this on a real world list of almost 18K names, and got a hit rate of around 34%.

oxinabox commented 6 years ago

With some effort a new database could be constructed. Goverments tend to release statistics on how popular each names is by year and sex.

This dataset does the USA: https://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-national-level-data

This dataset does Australia https://data.gov.au/dataset/popular-baby-names

England and Wales https://data.gov.uk/dataset/afe1871f-dede-41bf-a6ba-0a1d32217cdb/baby-names-england-and-wales

Northern Ireland https://data.gov.uk/dataset/9ebaf276-f4d5-41e9-bf22-b7ccab8cf85e/full-list-of-first-forenames-given-to-babies-registered-in-northern-ireland

I wouldn't be surprised if name usage was Zipfian. So truely vast numbers of very rare names

oxinabox commented 6 years ago

Julio Raffo, 2016. "Worldwide Gender-Name Dictionary," WIPO Economics & Statistics Related Resources 10, World Intellectual Property Organization - Economics and Statistics Division.

created a dataset from several sources included various government statistics, facebook and wikipedia. https://ideas.repec.org/c/wip/eccode/10.html

6.2 million names for 182 different countries It only works to a resolution of Male, Female or Androgynous, and gives no count information. and it is case-insensitive only But that is all fine.

Making that work would mean added DataDeps.jl as dependency because it is nontrivial in size, and writing an alternate loading function, using CSVFiles.jl And also adding the definition of what country codes are accepted into the Detector type. Since they vary between the datasets