fangzhou-xie / rethnicity

predict ethnicity from names
https://fangzhou-xie.github.io/rethnicity/index.html
9 stars 2 forks source link

interpretation of results on individuals #1

Open MataEE opened 2 years ago

MataEE commented 2 years ago

Hello,

I am excited to be trying out the rethnicity tool for my dataset, which contains around 340 names of film directors that were released in Germany in the past ten years. Since I am pretty new to R, I am not sure, if I simply made a mistake in my coding or if the tool is not suitable for the names in the dataset. When using method=fullname all names have highest prediction for “asian”, whereas when using method=lastname all names have highest prediction for “white”. When manually predicting for only two names of the dataset, those are both returned as highest prediction for “black”. I was wondering, if anyone has encountered this problem with non-US American based datasets at all?

Cheers, Mata

fangzhou-xie commented 2 years ago

Thanks for your interest in this package.

As far as I am concerned, none of the name-predicting-ethnicity/race method can be made 100% accurate (not even close), and there will be a nontrivial proportion of names misclassified. The ideal usage of this package (and other packages in this field) is to have a relatively large dataset and use the predicted ethnicity as regression covariate/regressor/variable, instead of looking at the predicted ethnicity individually.

As for your case, I think it might be better off collecting the German film directors' ethnicity manually, given it is a small sample. Moreover, the models are trained using US data and I suppose prediction in other country's setting might be different.

I will leave this issue open for others in the future to notice this issue as well.

MrMatsson commented 2 years ago

Thanks for the wonderful package. I too have a dataset consisting of Nordic European names. Have you perhaps heard of any script that works in EU? Been searching far and wide without success. Going to try out your script this week and see it's accuracy rate is decent.