CenterForAssessment / randomNames

Function to generate random gender and ethnicity correct first and/or last names. Names are chosen proportionally based upon their probability of appearing in a large scale data base of real names.
https://centerforassessment.github.io/randomNames
Other
32 stars 6 forks source link

`sample.with.replacement = FALSE` across ethnicities/ genders #55

Closed erleholgersen closed 6 years ago

erleholgersen commented 6 years ago

I have a possibly annoying feature request: Would it be possible to make sample.with.replacement = FALSE work across ethnicities/ genders?

I wanted a list of randomly generated, unique names, but had to use a work around with unique() to get it to work.

library(randomNames)                                          
set.seed(7)                                                   

# expected unique names, but some are duplicated                                                             
random_names <- randomNames(100, which.names = 'first',       
sample.with.replacement = FALSE)                              
any(duplicated(random_names))                                 
#> [1] TRUE

# by contrast, it works for a single ethnicity/ gender        
unique_random_names <- randomNames(100, which.names = 'first',
sample.with.replacement = FALSE, ethnicity = 1, gender = 1)   
any(duplicated(unique_random_names))                          
#> [1] FALSE
dbetebenner commented 6 years ago

Hello. This is a tough one. The sample is done (as your example points out) at the ethnicity by gender level. Thus, you end up with unique names at that level. The problem is that there are several first names that are in the data sets for both genders and multiple ethnicities.

I could add an addition check for uniqueness and if the names aren't unique, then try to add in unique names. The only problem with this (which would require another check is if the number of names being requested exceeds the number of names I have in the data sets, then it's impossible to get the unique names.

I'd probably add a different argument to the function to ensure that unique names are returned.

Would that help?

erleholgersen commented 6 years ago

Yeah, I figured it might be annoying! A different argument for unique names would work perfectly for my use-case at least. (And in fairness, so does generating more names than needed on my end and subsetting out the unique names – so don't feel like this is an essential addition!)