EricPostMaster / Are-You-Irish-Classifier

Streamlit application that uses Naive Bayes to assign users an "Irish-ness score" (the Murphy Index). ☘ Development process has application to spam detection on short-length documents.
MIT License
0 stars 0 forks source link

Add more names to the dataset #5

Open EricPostMaster opened 2 years ago

EricPostMaster commented 2 years ago

Initial dataset only uses names from Ireland, US, UK, and India. We need representation from many more countries.

skflwright commented 2 years ago

I’m going to research into this… back asap

EricPostMaster commented 2 years ago

@skflwright - Thanks so much for your help! I saw your comment on LinkedIn, and I love the idea of finding Irish immigrant names! The way the Naive Bayes model works is it needs actual names and actual quantities of those names, so if there are 10 people with the last name "Murphy" and 2 people with the last name "MacDougal", then the letters in "Murphy" will have a higher probability of appearing in an Irish name. That doesn't mean "MacDougal" isn't necessarily a very Irish surname, but it potentially gives us a better idea of the real distribution of letter combinations in real Irish names. Does that help clarify what I was thinking?

EricPostMaster commented 2 years ago

I have been pulling Irish names from lists of politicians on Wikipedia (like this page: https://en.wikipedia.org/wiki/Members_of_the_32nd_D%C3%A1il). If we jump back several sessions of the Irish Dail (like to maybe the 20th-24th) we can definitely expand the name list, probably by several hundred.

If we can find more of those for different countries, then we can expand the breadth of non-Irish names.

skflwright commented 2 years ago

Oh! That makes sense. Okay, so the best resource so far is the National Archives of Ireland (over 50,000 records!) and they list all names by county. In terms of international names, I've learned that the Mormons, who tracked that kind of data are now part of familysearch.org out of Utah. They have a billion+ records. Ten years ago, they started to digitize them then had to cut a deal with Ancestry.com to move the process along because they could only get 200K volunteers (this story screams data science!). At any rate, we'd have to put in a formal request and so far, familysearch has only digitized 1/3 of their records (still over 16k!). The Irish Archives might be able to help me. The Griffith resource I mentioned only has Ireland in 1864 but that would cover most Irish names ... trouble is many left to US between 1820 and 1860. There is another site that tracks geneology internationally but again it is has limited search capability. Can you use the github link above? I'll get back when I learn more

EricPostMaster commented 2 years ago

@skflwright Sounds great. A small number of Irish names would probably be fine, like 1k-2k, max. Any more than that and we'll have so many Irish names that it might make the data imbalanced. Of course, we need to add names from other countries, and as we add those we'll have more room for Irish names as well. Also, once we get the names, then we need to clean them, so hundreds of names might be easier to work with than thousands 😅

p.s. What GitHub link are you referring to? I don't see anything in your previous comment.

skflwright commented 2 years ago

This one: https://github.com/gaois/IrishSurnameIndex/blob/master/surnames.xml

EricPostMaster commented 2 years ago

Ah, yes, I can see it. It's not quite what we need for this exact type of model because each surname occurs once and only once, but I think it would be worth it to train a model on it and give it a try! Now we just have to get it imported (pd.read_xml) and cleaned. The cleaning is definitely a challenge.

skflwright commented 2 years ago

Okay, I've been pulling from the national archives from 1911 Census. There are over 500K records, each designated by county. I started to parse out the number of records by county so we could at least get a good representation of names by area. Let's say we want to work with 5000 records. What should we use to determine how much to pull from each. I know from my own family history that you could tell where a person came from by their name (usually tied to whether they were from the south, north east or west). So, there were 32 counties in 1911 (this is before England took over the north in 1920), which is good because we'd get all Irish. Sadly, I already have 3k names from Dublin! If we were to give fair representation of names from each place with only 5K, that'd be like 160 each, which means we would have to omit repeats of names - which kills the model weights. Let's say we took 1k names from each county... that would better depict variety of names and still give us repeats for weighting. Here's the link: http://www.census.nationalarchives.ie/search/

I have to crash now as I have to get up at 6. Let me know what you'd like to do. I can download 1k from each, which would give us 32K records but I don't know if I can do it by Thursday. The key is to just enter the county in the search and then adjust from there. Here's the Dublin names and frequency. They are unique (as can be gleaned by the ages and the year (1911).

Irish Names.xlsx