CenterForAssessment / randomNames

Function to generate random gender and ethnicity correct first and/or last names. Names are chosen proportionally based upon their probability of appearing in a large scale data base of real names.
https://centerforassessment.github.io/randomNames
Other
32 stars 6 forks source link

Random Error with large samples without replacement #68

Closed ratnanil closed 3 years ago

ratnanil commented 4 years ago

I want to generate many (9000) unique names to replace human unfriendly uuid-numbers. I wanted to extend the random name generator to enable infinite unique names by simply adding an integer when the maximum number of random names is reached.

In writing the function, I realised that I cant find a hard upper limit of number of names I can generate without replacement: I get an error at different ns.

For example below, the function returns an error on the first run, but is successful on the second, third and fourth try.

Can you elaborate on this?

library(randomNames)
#> Warning: Paket 'randomNames' wurde unter R Version 3.6.3 erstellt

set.seed(1)
one <- randomNames(5000, sample.with.replacement = FALSE)
#> Error in sample.int(length(x), size, replace, prob): kann keine Stichprobe größer als die Grundgesamtheit nehmen
#>  wenn 'replace = FALSE'
two <- randomNames(5000, sample.with.replacement = FALSE)
thr <- randomNames(5000, sample.with.replacement = FALSE)
fou <- randomNames(5000, sample.with.replacement = FALSE)

Created on 2020-03-18 by the reprex package (v0.3.0)

ratnanil commented 4 years ago

I'm realising that randomNames might be the wrong package to generate unique names. babynames has 97'310 unique first names that are easily accessible.

dbetebenner commented 4 years ago

Hi,

Setting sample.with.replacement=FALSE can cause this issue, especially with some of the ethnic subgroups where the database doesn't have a large set of names (e.g., arabic).

When you request 5000 random names, the function randomly selects them from each ethnic group and each sex. The function couldn't find enough random arabic first names leading to the error.

If you just want a bunch of random names, one thing you can do is just use unique with randomNames. Say you want 100,000 randomNames.

my.names <- unique(randomNames(150000))[1:100000]

I asked for more than 100000, knowing there might be a few duplicates and then just took the first 100000.

Hope that helps. Sorry for the slow reply