Open etiennebacher opened 7 months ago
This is a very interesting feature request, but I think that it could be tough to implement. It might also be too far outside what I see as the main goals for the package, namely to accelerate big-data joins with locality sensitive hashing techniques.
I am ambivalent about the current blocking functionality in the package and if I could go back in time, I might not have included it as it creates extra code to maintain. If it is useful to people (and it sounds like some are using it), I would also be happy to change my mind about it.
The reason I originally included the blocking is that it's easy to implement / add on to the existing Locality Sensitive Hashing algorithms. Internally the blocking variables are called "salts" because they are literally used to salt hashes. This ensures that only units with the same salting / blocking variables hash collisions, which is a prerequisite for being compared as a potential match.
If we want to add an inequality constraint it seems much tougher to do this within the locality sensitive hashing framework. If we can find away to do it, I would love to have a try at implementing it. I can ask a professor I know about this possibility.
If we can't find a way to implement the inequality blocking as part of the locality sensitive hashing, we would likely have to add on the inequality as a filter step once the matching rows have been returned from the LSH algorithm. In this case, I think we shouldn't call it blocking as much as we should include it as part of the by
argument (i.e. by = by(name_1 == name_2, year_1 > year_2)). I think that I would be very slightly less enthusiastic about this approach, but I would also be happy to implement it if you think there is sufficient demand for this kind of functionality.
Best, Ben
Thanks for your detailed answer! I have to think more about this (also curious to know what the prof you mention would say).
If we can't find a way to implement the inequality blocking as part of the locality sensitive hashing, we would likely have to add on the inequality as a filter step once the matching rows have been returned from the LSH algorithm.
I don't think this is worth implementing in the package. Blocking on inequality conditions would be a nice feature of the package IMO, because it reduces the work that the LSH has to do, but filtering afterwards based on some condition is a simple operation and is not related to LSH anymore.
Gotcha. Sounds like we are totally of the same mind about this. I will continue searching to see if there's a way to integrate the inequality constraints with the LSH and will keep you posted if I find anything. Until then, I won't try and add any filters to the package.
I just realized that you can do this by ordering the rows you hash and performing the hashing in sequence, but unfortunately the sequence operations mean that this can't be parallelized (so may not be faster than the filtering approach on a powerful setup).
Basically, if you want to to ensure that a field year.x
in dataset A
is smaller than smaller than another field, year.y
in dataset B
, you can order the rows of A and B by the year fields, and then hash in order. Take the following datasets as an example (I have ordered them already for simplicity),
Dataset A:
Name | Year.x |
---|---|
Ben | 1990 |
Ben | 2000 |
Jack | 2011 |
Dataset B:
Name | Year.y |
---|---|
Ben | 1980 |
Ben | 2010 |
Jack | 2020 |
If you hash the keys in order, the following happens:
This works because at each point, only rows in A with a value of year
less than the current value are stored in the hash table. We are only ever comparing rows of B to rows of A with a smaller value of year,
enforcing the inequality constraint.
As I mentioned before, I don't think this can be easily parallelized because the order in which we do the hashing now matters. If there are many rows in each dataset with the same value of year
, then it might be possible to parallelize over these observations.
Thank you for the potential solution, I must say I don't have a usecase for this anymore so I won't try to benchmark the two approaches.
I'm wondering if using polars
as a backend would make things faster. There is a mechanism to write extensions in Rust that can directly be applied on polars
DataFrames and LazyFrames (so far this is only available in Python Polars but I hope those extensions can be soon possible in r-polars).
I would be very interested in trying this out! My only concern right now is that we are very close to the upper limit (5MB) that CRAN imposes on vendored dependencies, and I worry that adding the polars
crate could push us over the edge. Do you have any experience distributing packages with large vendored dependencies on CRAN? Does the 5MB limit pose a serious constraint in practice, or are there ways to work around it?
best, Ben
My only concern right now is that we are very close to the upper limit (5MB) that CRAN imposes on vendored dependencies, and I worry that adding the polars crate could push us over the edge
True, this could be an issue, although I'm not sure how many of polars
several "subcrates" would be needed for this. Let's see in a few weeks / months if we have a working mechanism for extensions in r-polars.
Do you have any experience distributing packages with large vendored dependencies on CRAN? Does the 5MB limit pose a serious constraint in practice, or are there ways to work around it?
I personally don't, but I know some people who have packages > 5MB on CRAN. I think you have to argue on submission why this is reasonable (provided that you don't go over 20MB for instance). I also know there were some discussion to download crates from a "reliable source" (according to CRAN) and I think that Zenodo was mentioned as such a source. I think the discussion is on the R-devel list of those last months.
Is your feature request related to a problem? Please describe. It is nice to be able to use
block_by
to filter out some comparisons before computing the string similarity. Currently, it is limited to equality conditions (e.g rows must have the same year to be considered for matching). I have a setting in which I don't want to compare rows ifyear.y
is larger thanyear.x
, i.e I'd like to block matching by the conditionyear.x > year.y
.I don't know how hard that would be to implement. I could track the usage of
block_by
(renamedsalt
in the internals) until the methodnew()
for theShingleset
struct in Rust, but I don't know how that works exactly.Describe the solution you'd like Not sure of the syntax this should take. It could be something similar to
dplyr::join_by()
, so that one could doblock_by = block_by(year.x > year.y)
. Also, that only works if the variable has a different name in each dataset.Describe alternatives you've considered Currently I don't use
block_by
. Instead I match on the full data and filter out by my condition afterwards.Additional context /