CenterForAssessment / randomNames

Function to generate random gender and ethnicity correct first and/or last names. Names are chosen proportionally based upon their probability of appearing in a large scale data base of real names.
https://centerforassessment.github.io/randomNames
Other
32 stars 6 forks source link

Uninformative error message when exhausting names #83

Open joshwlambert opened 7 months ago

joshwlambert commented 7 months ago

It seems that when the number of names is exhausted when using randomNames() (with sample.with.replacement = FALSE) it gives an uninformative error message about sampling. It would be great if the {randomNames} package could provide the user with an custom informative error message when the requested number of names is too large. This error message can also suggest turning sample.with.replacement to TRUE to help.

Here is a reprex to show an example

library(randomNames)
set.seed(1)
gender <- rep(c("M", "F"), 2525)
names <- randomNames::randomNames(
    which.names = "both",
    name.sep = " ",
    name.order = "first.last",
    gender = gender,
    sample.with.replacement = FALSE
)
str(names)
#>  chr [1:5050] "Sebastian Clayton" "Melisa White" "Eli Jackson" "Malisse Ha" ...

gender <- rep(c("M", "F"), 3000)
names <- randomNames::randomNames(
    which.names = "both",
    name.sep = " ",
    name.order = "first.last",
    gender = gender,
    sample.with.replacement = FALSE
)
#> Error in sample.int(length(x), size, replace, prob): cannot take a sample larger than the population when 'replace = FALSE'

Created on 2024-01-18 with reprex v2.0.2

dbetebenner commented 7 months ago

Thank you for the comment.

I will think about how to add better messaging for the circumstance you provide.

If you are interested in getting a longer list of unique first.last name combinations, you can change sample.with.replacement = TRUE and then select out the unique combinations that occur.

The error you provide is because the internal data probably doesn't have enough female or male first names. Since the package is making combinations of first and last, there are probably millions of those.

To get 25,000 first/last name combinations you could do the following:

gender <- rep(c("M", "F"), 15000) names <- randomNames::randomNames( which.names = "both", name.sep = " ", name.order = "first.last", gender = gender, sample.with.replacement = TRUE )

unique_names <- head(unique(names), 25000)

I asked for 30,000 names to begin with to make sure I had 25,000 uniques.

I've considered how to add this little trick for creating LONG lists of names, but haven't quite figured out how to put this into the package well.

joshwlambert commented 7 months ago

Thanks for the response. I hadn't realised that sample.with.replacement = TRUE had a higher capacity for unique names. The suggestion of oversampling and then subsetting out the unique names worked well for my case. Here is a function I put together for that {simulist} package that is using {randomNames}. Feel free to use some of this code if it would be useful for {randomNames}.

#' Sample names using [randomNames::randomNames()]
#'
#' @description
#' Sample names for specified genders by sampling with replacement to avoid
#' exhausting number of name when `sample.with.replacement = FALSE`. The
#' duplicated names during sampling need to be removed to ensure each
#' individual has a unique name. In order to have enough unique names, more
#' names than required are sampled from [randomNames()], and the level of
#' oversampling is determined by the `buffer_factor` argument. A
#' `buffer_factor` too high and the more names are sampled which takes longer,
#' a `buffer_factor` too low and not enough unique names are sampled and
#' the `.sample_names()` function will need to loop until it has enough
#' unique names.
#'
#' @inheritParams .add_date
#' @param buffer_factor A single `numeric` determining the level of
#' oversampling (or buffer) when creating a vector of unique names from
#' [randomNames()].
#'
#' @return A `character` vector.
#' @keywords internal
.sample_names <- function(.data,
                          buffer_factor = 1.5) {
  m_idx <- .data$gender == "m"
  f_idx <- .data$gender == "f"
  num_m <- sum(m_idx)
  num_f <- sum(f_idx)
  num_sample_m <- ceiling(num_m * buffer_factor)
  num_sample_f <- ceiling(num_f * buffer_factor)

  # create sample of names so there are no duplicates
  names_m <- character(0)
  while(length(names_m) < num_m) {
    names_m <- unique(
      randomNames::randomNames(
        which.names = "both",
        name.sep = " ",
        name.order = "first.last",
        gender = rep("M", num_sample_m),
        sample.with.replacement = TRUE
      )
    )
  }

  names_f <- character(0)
  while(length(names_f) < num_f) {
    names_f <- unique(
      randomNames::randomNames(
        which.names = "both",
        name.sep = " ",
        name.order = "first.last",
        gender = rep("F", num_sample_f),
        sample.with.replacement = TRUE
      )
    )
  }

  # subset to use required number of names
  names_m <- names_m[1:num_m]
  names_f <- names_f[1:num_f]

  # order names with gender codes from .data
  names_mf <- vector(mode = "character", length = nrow(.data))
  names_mf[m_idx] <- names_m
  names_mf[f_idx] <- names_f

  # return vector of names
  names_mf
}