Vandivier / data-science-practice

practicing basic data skills
1 stars 1 forks source link

Not Urgent.. Genderize minor edits #16

Open Mbjoerkh opened 6 years ago

Mbjoerkh commented 6 years ago

Big picture: Genderize has worked very well! Below are 2 suggested edits.
1) A couple hundred entries (both fellows sponsors) have no predicted gender because the name we've genderized a middle initial or initial of a two-part first name, instead of a single first name.

Suggested solution: after comma that follows lastname; split text by space into separate names to be genderized. Example: Entry#318_4 "BRADY, J. Mark" we should genderize both "J." and "Mark" . This has the added benefit of solving problem 2)

2) Occasionally there are multiple firstnames that actually predict differently. See Sponsor "Lee Robert Johnston", Lee yields a 75% chance of male, but Robert yields 100%. Jointly we can accept it as male, but depending on threshold we may not currently, but more importantly, there are cases which I believe assigns the wrong gender because of this.

P.S. I completely understand if you wished we'd done this in Stata so I could contribute more to the legwork with this... (I plan on "starting" the stata and analysis bit later today..)

P.S.S. Is the difference between the output and ordered-output files that the latter includes the non-adjacent entries?

Vandivier commented 6 years ago

1 - great suggestion about the subname, let's call it, genderization. I would want to keep these factors seperate by the way. So that we are clear when using a middle name vs first name. I understand there is likely no effect; but it is worth testing rather than assuming. (Eg my brother's name is Michael Lauren Vandivier...the gender assumption would fail on M. Lauren in this case)

2 - I'm happy to use stata as little as possible: it's great for the statistical operations but I don't like it for wrangling

3 - ordered-output is guaranteed to be output in order :) output is not. Previously output happened to be in order (that is, input record one is associated with output csv row one), but this was not the case after integrating genderize as those http requests are async and may return unordered.

Mbjoerkh commented 6 years ago

1) Completely agree. 2) Works for me! 3) Ok. So it's ordered-output that should be used for analysis? and this explains why output has 220 more rows than output-ordered? (It's probably not be important for me to understand this, just making sure it's not an error..)

On Wed, Jan 31, 2018 at 10:42 AM, John Vandivier notifications@github.com wrote:

1 - great suggestion about the subname, let's call it, genderization. I would want to keep these factors seperate by the way. So that we are clear when using a middle name vs first name. I understand there is likely no effect; but it is worth testing rather than assuming. (Eg my brother's name is Michael Lauren Vandivier...the gender assumption would fail on M. Lauren in this case)

2 - I'm happy to use stata as little as possible: it's great for the statistical operations but I don't like it for wrangling

3 - ordered-output is guaranteed to be output in order :) output is not. Previously output happened to be in order (that is, input record one is associated with output csv row one), but this was not the case after integrating genderize as those http requests are async and may return unordered.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Vandivier/data-science-practice/issues/16#issuecomment-361971555, or mute the thread https://github.com/notifications/unsubscribe-auth/Afq-PkRUA23DY_NYs6ayYZlDNxaQnpzNks5tQImggaJpZM4R0I6M .

Mbjoerkh commented 6 years ago

Don't worry about the genderizing for now, Garett Jones(well, a podcast with him) convinced me it's worth throwing $100 at a fancier classifier. TTYL

Vandivier commented 6 years ago

another concern about replication... Any cost may have a large marginal negative effect on replication

actually, I'm facing the same issue with my Udacity study, and I've basically come to the same conclusion that a paid product is needed to make it feasible.

even so, I'd like to leave it an open issue for a free replication variant to exist. It could form a double-validation. As you mention, Name Prism is one free tool: http://www.name-prism.com/

Mbjoerkh commented 6 years ago

Completely agree with your points. I think name-prism will be excellent to get both "leaf nationality" and ethnicity. It is both free, and in my opinion, the most appropriate tool out there for our purposes.

  1. Let me know if you want me to do the legwork with nameprism - if I understand it correctly I should be able to use the API without all that much computer wizardry.

  2. Let me know if you want me to submit a request on our behalf to avoid the 1000 names a day rate limit. It's not necessary for Earhart, but it could be convenient; you have your udacity study and we could even quickly generate the "migration adjusted medal olympic standings", throw it in a blogpost and see if there's any interest. My hunch is that it's worth a shot! It shouldn't be more than 4-5 hours of total work, it's a novel application of a hot topic in various fields, and the Olympics is such a rare event that it may get some attention.