guess possible primary keys

moodymudskipper commented 1 year ago

I had this use case and wondered if it'd be relevant as a {dm} utility

Given an unknown data frame, what columns could form a primary key ?

I have an implementation here for local data frames but this could be adapted. We test combinations of columns starting with columns that have the most distinct values, and dismiss right away irrelevant combinations (e.g. 2 columns with 2 distinct values each cannot be a PK for a dataset of 10 rows because 2*2 < 10). We try first with 1 col, then 2, then 3, and stop there by default. We can choose to return early, as soon as we find enough candidates, or to give all possible candidates for the minimum required n of columns. There's a progress bar, which might be improved a bit.

guess_pk <- function(data, max_n = 3, max_res = Inf) { 
  n_col <- ncol(data)
  n_row <- nrow(data)
  max_n <- min(max_n, n_col)
  distincts <- purrr::map_int(data, dplyr::n_distinct)
  ordered_col_ids <- order(distincts, decreasing = TRUE)
  res <- list()
  for (n in seq_len(max_n)) {
    combs <- combn(ordered_col_ids, n)
    writeLines(sprintf("Testing %s combinations of %s", ncol(combs), n))
    pb <- progress::progress_bar$new(total = ncol(combs))
    for (i in seq_len(ncol(combs))) {
      pb$tick()
      comb <- combs[,i]
      if (prod(distincts[comb]) < n_row) next
      valid <- nrow(unique(data[,comb])) == n_row
      if (valid) {
        res[[length(res) + 1]] <- names(data)[comb]
        if (length(res) == max_res) return(res)
      }
    }
    # if we found matches for n cols, no need to check for n+1
    if (length(res) > 0) return(res)
  }
  res
}

# testing on mtcars, which doesn't actually have a proper pk if we ignore row names :

guess_pk(mtcars, max_res = 1)
#> Testing 11 combinations of 1
#> Testing 55 combinations of 2
#> [[1]]
#> [1] "qsec" "wt"

guess_pk(mtcars)
#> Testing 11 combinations of 1
#> Testing 55 combinations of 2
#> [[1]]
#> [1] "qsec" "wt"  
#> 
#> [[2]]
#> [1] "qsec" "disp"
#> 
#> [[3]]
#> [1] "qsec" "mpg" 
#> 
#> [[4]]
#> [1] "qsec" "hp"  
#> 
#> [[5]]
#> [1] "qsec" "drat"
#> 
#> [[6]]
#> [1] "qsec" "carb"
#> 
#> [[7]]
#> [1] "qsec" "cyl" 
#> 
#> [[8]]
#> [1] "qsec" "am"  
#> 
#> [[9]]
#> [1] "wt"  "mpg"

^{Created on 2023-03-01 with reprex v2.0.2}

krlmlr commented 1 year ago

We have dm::enum_pk_candidates(), but this only looks at simple keys.

A more efficient strategy for compound keys might be to look at the columns from the left (first column, then the first two, etc..)

moodymudskipper commented 1 year ago

More efficient assuming some common practice but less robust, we'll might get a 5 variables compound key where variables 3 and 5 would suffice. Or maybe we do a first pass as you suggest, we find our 5 variable compound key, and do a second pass trying to remove variables one by one, maybe starting from before last and going back to the first ? and these would eliminate in turn variables 4, 2, and 1 in this case. This will find a single solution but that seems a reasonable compromise. The function might do the second pass optionally.

cynkra / dm

guess possible primary keys #1824