fangzhou-xie / rethnicity

predict ethnicity from names
https://fangzhou-xie.github.io/rethnicity/index.html
9 stars 2 forks source link

Issue and Accuracy on European Names #2

Open MrMatsson opened 2 years ago

MrMatsson commented 2 years ago

Hi It's my first time downloading r-code from github but I installed the package via the normal "install.packages("rethnicity") to R, added the code in the picture below, ran it all and then tried to run " predict_ethnicity(firstnames = "Alan" , lastnames = "Turing")" As you can see in the console I've clearly just run "predict_fullname_ccp" and earlier "predict_fullname" yet it claims it can't find the object: '_rethnicity_predict_fullname' What have I missed?

image image image

fangzhou-xie commented 2 years ago

The package is not intended to be used this way. If you want to know how to use the package, please check out the documentation (https://fangzhou-xie.github.io/rethnicity/index.html). Besides, you don't need to "download" code from Github and modify source code. All you need to call is rethnicity::predict_ethnicity function after installation.

It seems to me that you might want to learn how to use R properly before using this package. If this is indeed the case, you can try reading the book R for datascience as a starter.

MrMatsson commented 2 years ago

Thank you. It seams that I didn't have Admin at my new Job computer so it couldn't properly call during "library(rethnicity)", fixed now.

I currently face a problem when threading. (New to R overall so would be more than thankful if you can spot my error).

When I run code on 1 individual at the time it works without a problem but when I try for my whole "Test dataset" it gives the exact same probability to all individuals.

Thank you for your time.

image image image image

fangzhou-xie commented 2 years ago

Please try to create a minimal reproducible example next time (with data and code). It could be difficult to help just by looking at pictures. Anyway, I can reproduce what you have in the following example.

d <- data.frame(fn = c("Adam", "Alan"), ln = c("Johnson", "Turing"))

# wrong way
rethnicity::predict_ethnicity(d["fn"], d["ln"], "fullname")

# correct way
rethnicity::predict_ethnicity(d$fn, d$ln, "fullname")

If you notice, the bracket notation will subset the dataframe into smaller dataframe, but the dollar notation will return a vector. The required inputs of the predict_ethnicity function are vectors not dataframes. And this has nothing to do with threading.

One side remark: it seems that you are using the package for European names. As mentioned in my papers and issue #1, the predicted results may not be reliable, since the model was trained on US dataset (hence only applicable for US names).

MrMatsson commented 2 years ago

Thank you so much! It was indeed the problem. Sorry for posting a new issue as well. New to github and couldn't find the old post until you linked to it. Will reframe from doing it on future posts.

Good news! Your classifier managed to correctly identify 92 % of individuals in my dataset and 79 % of Europeans in my dataset. Worth taking into account that the sample I tried only consists of 1000 individuals of which only 200 were European.

Well done. Out of the 7 different "ethnicity from name classifiers" I've tried, yours is the only one to score anything above 45 % in Northern EU. 79 % is amazing. Good job and thanks for the help.

fangzhou-xie commented 2 years ago

Glad to help. Thank you for sharing your result with me as well! It is good to know that this package could be generalized to European names as well.

fangzhou-xie commented 2 years ago

I will reopen this issue (similar to #1), hoping others will benefit from reading this (especially if one wants to analyze European names).

MataEE commented 2 years ago

Hi, I have posted in #1 regarding the same issue and only come to realize that it then must've been a problem with the code I wrote. I wanted to kindly ask, whether or not you could hint me at the problem as I cannot find it myself (even with the help above).

My initital code was:

# predict ethnicity from multiple first and last names / all names of dataset and save as d1 dataframe
firstnames<- read.csv(file="Datensatz_Namen_Regie_Onolytics_Firstnames.txt",  header=TRUE)
firstnames$FIRSTNAME

lastnames<- read.csv(file="Datensatz_Namen_Regie_Onolytics_Lastnames.txt",  header=TRUE)
lastnames$LASTNAME

df1 <- predict_ethnicity(firstnames = firstnames, lastnames = lastnames, method = "fullname")

The result is similar to that of MrMattson with all names returned as "asian".

If I instead add the suggestion of fangzhou-xie, I receive the following error:

# predict ethnicity from multiple first and last names / all names of dataset and save as d1 dataframe
firstnames<- read.csv(file="Datensatz_Namen_Regie_Onolytics_Firstnames.txt",  header=TRUE)
firstnames$FIRSTNAME

lastnames<- read.csv(file="Datensatz_Namen_Regie_Onolytics_Lastnames.txt",  header=TRUE)
lastnames$LASTNAME

rethnicity::predict_ethnicity(d$fn, d$ln, "fullname")

Error in rethnicity::predict_ethnicity(d$fn, d$ln, "fullname") : You must provide both 'firstnames' and 'lastnames' arguments!

I am sure this is a simple problem for people knowing how to use R, but I am fairly new to R and totally lost. Help would be much appreciated! Cheers!

fangzhou-xie commented 2 years ago

Yes, you have the same problem in the code. You have to understand the difference between "dataframe" and "vector" in R.

Do this:

# predict ethnicity from multiple first and last names / all names of dataset and save as d1 dataframe
firstnames<- read.csv(file="Datensatz_Namen_Regie_Onolytics_Firstnames.txt",  header=TRUE)

lastnames<- read.csv(file="Datensatz_Namen_Regie_Onolytics_Lastnames.txt",  header=TRUE)

df1 <- predict_ethnicity(firstnames = firstnames$FIRSTNAME, lastnames = lastnames$LASTNAME, method = "fullname")
MataEE commented 2 years ago

Thank you so much! That worked out fine and I now get what is meant by vectors and frames, starting at the beginning of the code already.

Unfortunately I cannot provide any further information (like MrMattson) about the accuracy of the tool because we have no data available to cross-check for the assigned categories asian, black, white and hispanic. We might run additional, manual coding analysis and I will make sure to report on the accuracy asap.

fangzhou-xie commented 2 years ago

No problem. Glad to help.

There is no need to report accuracy, but if you do, I would be very grateful. My paper shows around 70% accuracy rate on US names, and @MrMatsson showed 79% on the European dataset. It would be nice to know how the model perform and this will eventually help other people as well.