Open aidanhorn opened 1 month ago
Hi, and thanks for this suggestion.
It seems like you are suggesting that we would weight each unique string by how frequently it appears in the corpus (is this correct?). When I read the documentation for the cluster_fast_greedy function it looks like the weights argument used for edge weights (weights for the connections between strings), not node weights (weights applied to each string). Are you sure that this is the right way to incorporate the weighting you describe into the clustering procedure? I may have misunderstood something.
More broadly, I appreciate the insight that the function runs too slowly when we pass in inputs with exact duplicates. Would it make sense to just remove duplicates as a pre-processing step and then run the algorithm as normal applying no weighting?
Best, Ben
Hi Ben 🙂
Yes, I want to "weight each unique string by how frequently it appears in the corpus". Thanks for the clarification that the cluster_fast_greedy
function takes edge weights, not node weights.
I have sped up my processing a lot by only using the unique string vector, as you asked. Additionally, I am dropping the least-common dirty categories, cleaning the top half with zoomerjoin
, then taking the uniqueness of that and mapping it to the original unique dirty vector with a distance matrix.
I then loop through each of the clean strings, mutating the big dataset with an ifelse()
.
Do you think you should modify your function to strip down the vector to the unique strings, thereby improving performance?
Thanks for the suggestion - I edit the function so that it operate on the unique version of the inputs as you recommend.
Maybe we should rather keep a proportion of the duplicated strings, which would still speed up the function, but bring weighting back in. This parameter can even be included in the options of the function. For example, if $x$ is the length of the unique vector, and $y$ is the length of the dirty vector, the final length of the filtered vector can be set to be $x + 0.01\times (y-x)$.
Or rather, if the length of the dirty string exceeds a minimum threshold, then only a proportion of each group should be filtered. A simple solution would be to keep $\mathrm{round}\left(\log_{10}(n)+1\right)$ observations within each group.
After grouping the string, perhaps you could leave the logarithmic base as a parameter for the user? The concept is
tibble_in %>%
group_by(variable) %>%
filter(row_number() <= round(log(n(), base) +1))
Is your feature request related to a problem? Please describe.
jaccard_string_group()
takes too long on 25 million rows with about 1000 dirty categories, paring down to about 200 clean categories. But, it can process the unique dirty string vector within minutes. However,jaccard_string_group()
does not pass through a weights vector tocluster_fast_greedy()
, so all the dirty strings in the unique vector would have an equal weight.Describe the solution you'd like Please include an option to pass weights to
jaccard_string_group()
.Describe alternatives you've considered I have copied the function and tried to include this option, but I do not have Rust installed and I'm not sure how to compile everything using Rust.
Additional context