AndriSignorell / DescTools

Tools for Descriptive Statistics and Exploratory Data Analysis
http://andrisignorell.github.io/DescTools/
86 stars 18 forks source link

Permn with repeated values could be much faster #58

Open DarwinAwardWinner opened 4 years ago

DarwinAwardWinner commented 4 years ago

I found the Permn function in this package, which handles repeated values by only returning unique permutations. However, I looked at the implementation, and it still expands the full N! permutations before deduplicating them, which makes it very slow even in certain cases where the return value is not very large, e.g. Permn(c(rep(1, 10), 2)). I made an implementation of this that runs in time approximately proportional to the size of the output, but it doesn't currently support the bells and whistles of Permn, and the interface and implementation are a bit rough. If you are interested in using this, please let me know and I will clean it up and create a pull request. Or you can adapt the algorithm on your own.

library(dplyr)
all_unique_perms <- function(
    x, max_allowed_perms = 1e6, 
    # This argument is used only for recursion
    x_table
) {
    if (!missing(x)) {
        assert_that(!any(is.na(x)))
        if (!missing(x_table)) {
            stop("Cannot accept both x and x_table arguments")
        }
        x_table <- tibble(x=x) %>%
            group_by(x) %>% 
            summarise(count = n(), .groups = "drop") %>%
            arrange(desc(count)) %>%
            filter(count > 0)
        if (is.finite(max_allowed_perms)) {
            assert_that(factorial_product_ratio(sum(x_table$count), x_table$count) <= max_allowed_perms)
        }
    }
    if (nrow(x_table) == 0) {
        # x is empty
        return(matrix(x_table$x, nrow = 0, ncol = 0))
    } else if (nrow(x_table) == 1) {
        # Base case: x is all identical, so only 1 unique permutation
        return(rbind(rep(x_table$x, x_table$count), deparse.level = 0))
    } else if (all(x_table$count == 1)) {
        # Base case: all values distinct, so just do regular permutation
        return(Permn(x_table$x))
    } else if (nrow(x_table) == 2) {
        # Base case: x has 2 distinct values, so permutation is equivalent to
        # finding all unique ways to thread one of the values between the other.
        combs <- combn(sum(x_table$count), x_table$count[1])
        # Initialize with all 2nd value, then fill in 1st value wherever the
        # above combinations say to
        result <- matrix(x_table$x[2], nrow = ncol(combs), ncol=sum(x_table$count))
        i1 <- cbind(as.vector(col(combs)), as.vector(combs))
        result[i1] <- x_table$x[1]
        return(result)
    } else {
        # Recursive case: x has 3 or more distinct values and contains repeated
        # values.
        FirstVsRest_table <- x_table %>% 
            group_by(x = x != x[1]) %>% 
            summarise(count = sum(count), .groups = "drop")
        rest_table <- x_table[-1,]
        FirstVsRest_perms <- all_unique_perms(x_table = FirstVsRest_table)
        rest_perms <- all_unique_perms(x_table = rest_table)
        full_perms <- matrix(
            x_table$x[1], 
            ncol = ncol(FirstVsRest_perms), 
            nrow = nrow(FirstVsRest_perms) * nrow(rest_perms))
        for (FvR_row in seq_len(nrow(FirstVsRest_perms))) {
            for (rest_row in seq_len(nrow(rest_perms))) {
                dest_row <- (FvR_row-1) * nrow(rest_perms) + rest_row 
                full_perms[dest_row, FirstVsRest_perms[FvR_row,]] <- rest_perms[rest_row,]
            }
        }
        return(full_perms)
    } 
}
AndriSignorell commented 4 years ago

This could be a good approach, would you "clean it up"? (possibly in base R...) ;-)