beniaminogreen / zoomerjoin

Superlatively-fast fuzzy-joins in R
https://beniamino.org/zoomerjoin/
GNU General Public License v3.0
97 stars 5 forks source link

Pass weights to `jaccard_string_group()` (or speed up the function) #126

Open aidanhorn opened 1 month ago

aidanhorn commented 1 month ago

Is your feature request related to a problem? Please describe. jaccard_string_group() takes too long on 25 million rows with about 1000 dirty categories, paring down to about 200 clean categories. But, it can process the unique dirty string vector within minutes. However, jaccard_string_group() does not pass through a weights vector to cluster_fast_greedy(), so all the dirty strings in the unique vector would have an equal weight.

Describe the solution you'd like Please include an option to pass weights to jaccard_string_group().

Describe alternatives you've considered I have copied the function and tried to include this option, but I do not have Rust installed and I'm not sure how to compile everything using Rust.

Additional context

jaccard_string_group <- function(   string,
                                    n_gram_width = 2,
                                    n_bands = 45,
                                    band_width = 8,
                                    threshold = .7,
                                    progress = TRUE,
                                    cluster_weights = NULL) {
  if (!requireNamespace("igraph")) {
    stop("library 'igraph' must be installed to run this function")
  }

  pairs <- rust_jaccard_join(string,
    string,
    ngram_width = n_gram_width,
    n_bands,
    band_size = band_width,
    threshold = threshold,
    progress = progress,
    seed = round(stats::runif(1, 0, 2^64))
  )

  graph <- igraph::graph_from_edgelist(pairs)
  if (packageVersion("igraph") < "2.0.0") {
    fc <- igraph::fastgreedy.community(igraph::as.undirected(graph))
  } else {
    fc <- igraph::cluster_fast_greedy(igraph::as.undirected(graph), weights=cluster_weights)
  }
  groups <- igraph::groups(fc)
  lookup_table <- vapply(groups, "[[", integer(1), 1)
  membership <- igraph::membership(fc)
  return(string[lookup_table[membership]])
}
beniaminogreen commented 1 month ago

Hi, and thanks for this suggestion.

It seems like you are suggesting that we would weight each unique string by how frequently it appears in the corpus (is this correct?). When I read the documentation for the cluster_fast_greedy function it looks like the weights argument used for edge weights (weights for the connections between strings), not node weights (weights applied to each string). Are you sure that this is the right way to incorporate the weighting you describe into the clustering procedure? I may have misunderstood something.

More broadly, I appreciate the insight that the function runs too slowly when we pass in inputs with exact duplicates. Would it make sense to just remove duplicates as a pre-processing step and then run the algorithm as normal applying no weighting?

Best, Ben

aidanhorn commented 1 month ago

Hi Ben 🙂 Yes, I want to "weight each unique string by how frequently it appears in the corpus". Thanks for the clarification that the cluster_fast_greedy function takes edge weights, not node weights.

I have sped up my processing a lot by only using the unique string vector, as you asked. Additionally, I am dropping the least-common dirty categories, cleaning the top half with zoomerjoin, then taking the uniqueness of that and mapping it to the original unique dirty vector with a distance matrix. I then loop through each of the clean strings, mutating the big dataset with an ifelse().

Do you think you should modify your function to strip down the vector to the unique strings, thereby improving performance?

beniaminogreen commented 1 month ago

Thanks for the suggestion - I edit the function so that it operate on the unique version of the inputs as you recommend.

aidanhorn commented 1 month ago

Maybe we should rather keep a proportion of the duplicated strings, which would still speed up the function, but bring weighting back in. This parameter can even be included in the options of the function. For example, if $x$ is the length of the unique vector, and $y$ is the length of the dirty vector, the final length of the filtered vector can be set to be $x + 0.01\times (y-x)$.

aidanhorn commented 1 month ago

Or rather, if the length of the dirty string exceeds a minimum threshold, then only a proportion of each group should be filtered. A simple solution would be to keep $\mathrm{round}\left(\log_{10}(n)+1\right)$ observations within each group.

aidanhorn commented 1 month ago

After grouping the string, perhaps you could leave the logarithmic base as a parameter for the user? The concept is

tibble_in %>%
    group_by(variable) %>%
    filter(row_number() <= round(log(n(), base) +1))