Gender Prediction Methods Based on Name

Hamedloghmani commented 1 year ago

Hello @gabrielrueda, You have been added to this repo and you can log and discuss your process here from now on as well as Trello.

gabrielrueda commented 1 year ago

Update

Here is an update: I researched how to detect gender based on name. First, I realized that there were a few APIs to detect gender, but those had a limits which will not be practical for us. Then I looked into three articles labeled as Methods 1-3 in the PDF. In the PDF I provide some rough notes that I got from each of the articles. Methods 1&3 seem to have a similar approach while Method 2 is different.

For preprocessing Method 1 and 3 would split each character and assign a number to each possible character. Method 2 would use a count vectorizer. This means it would get substrings of a size specified (e.g. 2-4 chars) and you would get all the substrings possible in your dataset. Then for each name you would count the repetitions of each sub-string.

For the model, Method 1 and 3 use a bidirectional LSTM. Method 2 would use logical regression. From what I read, a logical regression would be more lightweight, but less accurate. However, I am not too familiar with these and will probably do more research on this.

Also, there are possible differences in patterns for names from different countries. We could possibly train different models for a few different countries, but I am not sure how practical this would be.

We can also account for names that are used interchangeably between two genders (e.g. Alex) and ignore those.

Let me know if you have any suggestions.

Notes

Method 1:

https://towardsdatascience.com/boy-or-girl-a-machine-learning-web-app-to-detect-gender-from-name-16dc0331716c
Input Modifications:
lowercase
split each character
pad empty spaces to make all names same length
encode characters to numbers
- space = 0, a = 1, b = 2, ...
Encode gender (F to 0 and M to 1)

NLP Model:

Embedding Layer

to embed each input character's encoded number into a dense 256 dimension vector.
embedding is a method used to represent discrete variables as continuous vectors

Bidirrectional LSTM Layer

read the seq of character embeddings from the previous step and output a single vector representing that sequence
The values for units and dropouts are hyperparameters as well

Dense Layer

outputs single value close to 0 as 'F'
- close to 1 for 'M'
Not sure if we should also have threshold
- options for kind of male or kind of female

Method 2:

https://pub.towardsai.net/predicting-name-gender-from-notebook-to-production-99e51d2aabd7

Generate Features

Count Vectorizer: a way to build vocabulary and features from a corpus automatically
- frequency of substrings in a certain string
Example (2-4) char count vectorizer for "Chris"
- Ch
- hr
- ri
- is
- chr
- hri
- his
- chri
- hris

For These few names: 2-4 grams

Logistic Regression

lightweight model
- if runtime is more important than accuracy -> this is a good option
other options: Decision Trees, Neural Networks, SVMs
input: the frequency of the repetition of the sub-strings defined above
- Example:

Improvements Suggested:

87% accuarcy -> ways to improve this could be
- training one model per country -> femine/masculine may diff in differnt countries
- vocabulary size could be too large for count vectorizer

Method 3:

https://maelfabien.github.io/machinelearning/NLP_7/#
working with dataset from France

Data Preprocess

removes accents from letters

LSTM

also uses LSTM algorithm -> character level LSTM
used bi-dirrectional LSTM

Hamedloghmani commented 1 year ago

Hi Gabriel,

Thank you so much for the update.

Great explanation. I'll take a deeper look at the methods that you kindly suggested and send you an update then we'll choose between these methods.

Thanks a lot.

Hamedloghmani commented 1 year ago

Hello @hosseinfani , @gabrielrueda I read and thought about the methods. Based on the pros and cons and their evaluation results, I think the third method is more accurate and practical. However, there are some significant concerns about using machine learning to predict gender based on first names:

• We need a massive dataset of names from around the world with very little or no bias in the distribution of names. This is because our target dataset (for example, DBLP) is from all around the world and is fitted on a specific region or country will considerably decrease the model’s performance in the prediction phase.

• The best method offered around 89% accuracy with potential signs of overfitting. Even if we consider that 90%, it is still not sufficient at all in our case. In a small scale, we will have almost 10,000 misses out of 100,000. This is critical because based on the results from this model, we will conduct research on “fairness and bias” and make assumptions about that. I think this will be a loose end if we continue based on noisy or faulty predictions.

To put it in a nutshell, I think paid query-based APIs will be our best option at the moment. As Gabriel kindly mentioned, there are a few, and I looked into 2 popular options:

The pricing seems reasonable, especially the second one.

I would be happy to hear your thoughts about my opinion.

hosseinfani commented 1 year ago

The second one worked pretty well on some samples. also it accept country

Honestly, we can try both and make a vote.

How much will they cost us for dblp, imdb, gith?

for uspt, we have it already I believe.

Hamedloghmani commented 1 year ago

@hosseinfani

DPLP v12 has 4894081 records
IMDB title.basics.tsv.gz has 6321302 records
They sum up to 11,215,383 records. I think we have to get the 99$/month plan.

I don't think this method will be effective on Github though. Because I examined some data and too many people use nick names or stuff like that as their name. But if we count that, we might have around 1,000,000 there.

hosseinfani commented 1 year ago

@Hamedloghmani how many months we need? is there any educational discount?

Hamedloghmani commented 1 year ago

@hosseinfani I'll check for educational discount asap. We'll be done in almost 1 month by 10.000.000 requests/month which is 99$/month

gabrielrueda commented 1 year ago

@hosseinfani @Hamedloghmani It seems the second option (https://genderize.io/) is good based on it being more affordable than the other website, while having data from various countries.

I also had an idea to reduce number of requests we would need to make:

What if we kept our own record of name and gender every time we make a request. This would mean if there were duplicate names in our datasets, than it wouldn't have to make the request again. This could possibly reduce the 11,215,383 amount to a number below 10 000 000.

Hamedloghmani commented 1 year ago

@gabrielrueda I think it's a brilliant idea. Will have a bit of time complexity for us but I think the number of requests will be significantly decreased.

gabrielrueda commented 1 year ago

@Hamedloghmani Sounds good, I can begin to write a program to keep record duplicate names and record the gender to each respective name. Should I add the gender parameter to each person's record or make a new record of just person and gender? Also should I start with the toy dataset from DBLP?

Hamedloghmani commented 1 year ago

Hi Gabriel, hope you are doing well. Last time we spoke in the issue page, I mentioned 2 API's regarding gender retrieval for names. Before entering the final phase and buying one, we are willing to do an experiment. I broke down the steps as follows:1) Create an Excel or csv file with 100 random full names ( Firstname and Lastname). Please try to include some diversity in these samples regarding gender and country of origin.2) Using the free version of each API, get the results for each of these samples. gender-api takes lastname too, but genderize.io does not.3) We have to compare the results and decide between them based on the output that we get from this experiment. The final output will be 4 columns, first two are name and last name, third and forth are gender results from each of those APIs.Please note that genderize.io also has a link to a python library for usage as well as their API details, it might be helpful.You can log your process in the issue page and define tasks in Trello as well.Let me know what do you think about it.

hosseinfani commented 1 year ago

@gabrielrueda @Hamedloghmani I assume the dataset is dblp, right? Also, keep track of what experts has which (firstname-lastname). Because, later we want to double check with the actual persons in the dataset. Something like this:

[40], Hossein, Fani, 0, 0 [42], Ali, Fani, 0, 1 ... 0 being male, 1 being female, e.g., Hossein Fani is the 40th author in dblp

make sense?

Hamedloghmani commented 1 year ago

@hosseinfani Exactly, we are starting with dblp.

gabrielrueda commented 1 year ago

Hello @Hamedloghmani, I have made some observations. I took 100 names from the DBLP trying to include diversity (although I ended up with a majority of male names (77 vs 23)).

Observations:

96/100 names had the same results for both APIs
Gender-API had no NULL results
Genderize has three 3 entries with a NULL result
Thus, only 1 of the names had a conflicting result (male vs female)

Probabilities: Since both of the APIs had a accuracy/probability of their result, I graphed the accuracy of each results in order to observe the accurarcies. For the most part, both APIs have around the same value. However, at times one API would get a higher percentage than the other, especially in the names which were harder to predict.

Here is the graph for the first 10 entries:

accuracies_0-10

I have the output.csv, the remaining bar graphs for the accuracies, and the python scripts to formulate this information. Where in the repository should I share this information, or should I share it privately?

Hamedloghmani commented 1 year ago

Hello @gabrielrueda Thank you so much for your update and informative representation. Based on the plot it seems like usually Genderize is outperfoming Gender API.

Please push your code as a .py file in fair_team_formation/src/util directory. I'll examine the full results and we will shorty start with labeling the whole DBLP dataset.

Thanks

gabrielrueda commented 1 year ago

@Hamedloghmani I just pushed my code.

Hamedloghmani commented 1 year ago

@gabrielrueda Thanks a lot. I'll review the code and the results asap.

gabrielrueda commented 1 year ago

@Hamedloghmani

Here are observations that I mentioned about earlier. These observations are some of the inconsistencies that occur when names are represented in the dataset.

Case 1) The dataset shows "DuarteCesar" but on dblp the actual name is "Cesar Duarte" Case 2) The dataset shows "A AbramovSergei" but on dblp the actual name is "Sergei A. Abramov" Case 3) The dataset show "A A Aoude" but there's multiple results on dblp (no first name given it seems) Case 4) The dataset shows "M. Turunen" but M is just the first letter of the first name

I looked into a few names like case 1 and case 2 and there seems to be a pattern, that when there is no space between the names, the order for first name, last name is reversed. I was wondering if I could implement something in the code to detect that and assume case 1 and case 2. As for case 3 and 4, I was thinking I could just discard those.

My idea to deal with these cases:

Pass Through 1:

Create new json file(e.g. dblp_correctNames.json)
If the name is successful in finding first name using regular method (FIRSTNAME SPACE LASTNAME), that json row is copied to the dblp_correctNames.json file.
If the name follows case (1) or case (2), then the name will be modified and also copied to the dblp_correctNames.json file.
Otherwise, if the name follows case (3) or case (4), those will be copied to another json file (dblp_failed_to_parse_name.json), where they could possibly be used for future if needed.

Pass Through 2:

Will go through dblp_correctNames.json and record all unique names as originally intended. They will be recorded in an indexed pandas dataframe with empty columns of "Gender" and "Probability"

I will let you know when I implement this idea in code and I'll share the results.

Hamedloghmani commented 1 year ago

@gabrielrueda Thank you so much for your progress report.

As we discussed, your idea was great based on the observations that you had. Thank you, looking forward to seeing the implementation and results 😄

gabrielrueda commented 1 year ago

Hello @Hamedloghmani,

After running the filter on the intitial dataset, and then finding all the unique first names, the program found 275 859 unique first names in the filtered dataset (Removing Cases 3 & 4 as well modifying cases 1 & 2).

I have made a pull request (#59) to include the code. Also I have included the results as a uniquenames_filtered.pkl and uniquenames_filtered.csv files (uniquenames_filtered.pkl is there to preserve the index I setup on the pandas dataframe).

I also have the json files for dblp_correctNames.json and dblp_failed_to_parse.json. I can send these privately since the files are too large to be pushed to git.

Let me know if you have any questions.

Hamedloghmani commented 1 year ago

Thank you so much for the update @gabrielrueda I'll go over your code and results today. We'll start labeling after I merge this request and refactor it. I'll keep you posted. You can upload the large results in Teams -> Adila -> DBLP Labeling Files Thanks.

gabrielrueda commented 1 year ago

Hello @Hamedloghmani, I filled in the table for genders of each unique name, The table will be stored as both a .pkl and .csv, but use the .pkl to import data into the program just like the uncommented lines in the main function.

I basically completed two steps in order to obtain data:

Made http requests to get the data from genderize and outputted the whole thing to a text file. (I have the text files on my computer if your interested) -> makeParallelAPIReqs()
Read from those text files and updated the values in the dataframe -> addGenderResultsFromFile()

I left some of things commented at the bottom, since for a chunk of the data (between records 90k and 160k), I obtained the data in a different way, however for future use, the functions in class should be used.

Changes are in pull request #61

Hamedloghmani commented 1 year ago

Hi @gabrielrueda

Thank you so much for the implementation. I merged your pull request. I believe we can proceed to the final phase and label the whole dataset since we have the gender for all the unique names. What do you think ?

gabrielrueda commented 1 year ago

@Hamedloghmani I think that we are ready to label the whole dataset. Also, some of the entries will result in NULL for gender/probability. Should we just filter out those entries when creating our new dataset?

Hamedloghmani commented 1 year ago

@gabrielrueda Great ! Yes, I believe we can filter them out.

gabrielrueda commented 1 year ago

@Hamedloghmani Also, would a structure like this: "gender": {"value": true, "probability": 0.97} for each author be good to represent the values in the dataset?

Example: {"id":1,"authors":[{"name":"Hinton","gender": {"value": true, "probability": 0.97},"org":"Shinshu University","id":1},{"name":"LeCun","gender": {"value": false, "probability": 0.87},"org":"Shinshu University","id":3}],"fos":[{"name":"Machine Learning","w":0.45139},{"name":"Image Captioning", "w":0.3241}],"title":"Preliminary Design of a Network Protocol Learning Tool Based on the Comprehension of High School Students: Design by an Empirical Study Using a Simple Mind Map","year":2000,"n_citation":1,"page_start":"89","page_end":"93","doc_type":"Conference","publisher":"Springer, Berlin, Heidelberg","volume":"","issue":"","doi":"10.1007/978-3-642-39476-8_19","references":[2005687710,2018037215],"indexed_abstract":{"IndexLength":58,"InvertedIndex":{"tool.":[42],"study":[4],"aim":[37],"purpose":[1],"scientific":[17],"for":[11],"aspects":[18],"students":[14,46],"focus":[27],"hands-on":[47],"learning":[9,41],"experience":[48],"our":[40],"we":[26],"network":[33,56],"The":[0],"More":[24],"high":[12],"protocols.":[57],"school":[13],"and":[21],"of":[2,19,32,55],"communication":[22],"protocols":[34],"gives":[45],"on":[28],"a":[8],"studying":[15],"specifically,":[25],"this":[3],"understand":[51],"is":[5],"develop":[7,39],"Our":[43],"tool":[10,44],"the":[16,29,36,52],"help":[50],"as":[35],"principles":[31,54],"information":[20],"networks.":[23],"to":[6,38,49],"basic":[30,53]}},"venue":{"raw":"International Conference on Human-Computer Interaction","id":1127419992,"type":"C"}}

Hamedloghmani commented 1 year ago

@gabrielrueda Yes. I think that's great since we might need to use inferred genders with specific levels of confidence in some cases.

Thanks.

gabrielrueda commented 1 year ago

Hi @Hamedloghmani, I just wanted to let you know that I labelled the dataset, and kept all the successful entries (entries whose name could successfully a gender). I will create a pull request for the code. Where should I upload the labelled dataset (since it's too big to upload to GitHub)?

Hamedloghmani commented 1 year ago

Hello @gabrielrueda Thank you so much for the update. I'll go over your pull request tonight. Please upload them in MS Teams -> Adila -> DBLP Labeling Files You can create another folder inside this directory if you want. I can reformat it later, no worries.

Let me know if you run into any issues. Thanks

gabrielrueda commented 1 year ago

@Hamedloghmani The dataset finished uploading. It should be in MS Teams -> Adila -> DBLP Labeling Files -> FinalGenderLabelledDataset. Let me know if you are unable to see/access it.

Thanks

Hamedloghmani commented 1 year ago

@gabrielrueda Thank you Gabriel. I just checked and I got access to it.

gabrielrueda commented 1 year ago

Hi @Hamedloghmani , I finished labelling the dataset, but I filtered some of the members from the name.basics.tsv file. Since I filtered out the names, should I update the title.basics.tsv and title.principals.tsv files to remove the entries that include the names that I filtered out in the labelling process?

Hamedloghmani commented 1 year ago

Hello @gabrielrueda Thank you so much for the update. You can consider doing that if it is not too time consuming. In the next steps we would discuss mapping the names from the OpeNTF outputs to the raw dataset and also a policy for missing names. I will schedule a meeting to talk about that soon if you want.

gabrielrueda commented 1 year ago

Hello @Hamedloghmani, It shouldn't be time consuming to filter it out in the other files, so I'll do that and let you know when I complete it. As for the next task, yes it would be great to schedule a meeting for that.

gabrielrueda commented 1 year ago

Hi @Hamedloghmani, I finished filtering the entries in title.basics.tsv and title.principals.tsv in order to match what was filtered in the names.basics.tsv file. These 3 files are uploaded to the "IMDB Labelling Files" folder. I'll make a pull request tomorrow for the code, I just have to add comments and rename some of the functions.

Hamedloghmani commented 1 year ago

Hello @gabrielrueda Thanks a lot for the update. Please make the pull request on the main branch this time. Including your previous implementations for dblp.

Thank you

gabrielrueda commented 1 year ago

Hi @Hamedloghmani, I just created the pull request for the code.

Thanks, Gabriel

Hamedloghmani commented 1 year ago

Hello @gabrielrueda I merged your pull request. Thanks a lot.

gabrielrueda commented 1 year ago

Hello @Hamedloghmani,

Here is the approach I will take to map the indexes from OpeNTF to the raw dataset:

Part 1:

In the indexes.pkl there is the 'i2c' and 'c2i' dictionaries:

Example:

'i2c': {
    9 : '54977.0_Reginald_Barker'
}

c2i: {
    '54977.0_Reginald_Barker' : 9
}

The string '54977.0_Reginald_Barker' will include the member id from the raw dataset before the '.'' In this case it would be 54977

Part 2:

However, in the name.basics.tsv file, the member id would be listed as 'nm0054977'.

The member id has the layout of "nm" + 7 digits.

e.g. 54977 would be 'nm0054977'

(add 0's in unassigned digits)

Part 3: The mapping:

Loop through 'c2i' dictionary and create new dictionary with layout as described in part 2. {memberID: opeNTF_output_index}
Create a pandas dataframe: (make sure to set opeNTF_output_index as the index)

opeNTF_output_index	gender	probability
(integer)	(true/false/null)	(double from 0.0 to 1.0 or null)

true = male, false = female

Loop through name.basics_labelled.tsv. For each member id, find the index using the dictionary created in step 1. Add the opeNTF_output_index and the gender/probability from the file to the new dataframe.
Then, loop through the list of keys from 'i2c'. If the opeNTF_output_index is not in the pandas dataframe, the add that opeNTF_output_index with gender as null and probabilty as null.
Export the dataframe as .csv and .pkl for future use.

Hamedloghmani commented 1 year ago

Hi @gabrielrueda Thank you so much for your report. The approach makes sense as we discussed. You can proceed with the implementation.

Thanks

gabrielrueda commented 1 year ago

Hi @Hamedloghmani, I finished the implementation. Should I make a pull request for my implementation to the dev or main branch?

Thanks, Gabriel

Hamedloghmani commented 1 year ago

Hi @gabrielrueda Thank you so much for the update. Make it to the main branch please.

Thanks

gabrielrueda commented 1 year ago

Hello @Hamedloghmani,

I just wanted to let you know that I changed the true/false in the dataset to M/F, as well as updated the results in mapping table for IMDB. I also created a mapping table for DBLP.

In the pull request I updated the mappingGender.py and labelDataset.py for the changes needed. I have also included the file called changeDataset.py, which I only used as a temporary way to change the values to true/false to M/F.

Finally, the updated datasets and mapping tables will be uploaded to the Teams Adila channel in folders of DBLP Labelling Files/Gender Mappings and IMDB Labelling Files/Gender Mappings (I'll let you know when they finish uploading).

fani-lab / Adila