beniaminogreen / zoomerjoin

Superlatively-fast fuzzy-joins in R
https://beniamino.org/zoomerjoin/
GNU General Public License v3.0
97 stars 5 forks source link

[FR] Allow inequality conditions in `block_by` #100

Open etiennebacher opened 7 months ago

etiennebacher commented 7 months ago

Is your feature request related to a problem? Please describe. It is nice to be able to use block_by to filter out some comparisons before computing the string similarity. Currently, it is limited to equality conditions (e.g rows must have the same year to be considered for matching). I have a setting in which I don't want to compare rows if year.y is larger than year.x, i.e I'd like to block matching by the condition year.x > year.y.

I don't know how hard that would be to implement. I could track the usage of block_by (renamed salt in the internals) until the method new() for the Shingleset struct in Rust, but I don't know how that works exactly.

Describe the solution you'd like Not sure of the syntax this should take. It could be something similar to dplyr::join_by(), so that one could do block_by = block_by(year.x > year.y). Also, that only works if the variable has a different name in each dataset.

Describe alternatives you've considered Currently I don't use block_by. Instead I match on the full data and filter out by my condition afterwards.

Additional context /

beniaminogreen commented 6 months ago

This is a very interesting feature request, but I think that it could be tough to implement. It might also be too far outside what I see as the main goals for the package, namely to accelerate big-data joins with locality sensitive hashing techniques.

I am ambivalent about the current blocking functionality in the package and if I could go back in time, I might not have included it as it creates extra code to maintain. If it is useful to people (and it sounds like some are using it), I would also be happy to change my mind about it.

The reason I originally included the blocking is that it's easy to implement / add on to the existing Locality Sensitive Hashing algorithms. Internally the blocking variables are called "salts" because they are literally used to salt hashes. This ensures that only units with the same salting / blocking variables hash collisions, which is a prerequisite for being compared as a potential match.

If we want to add an inequality constraint it seems much tougher to do this within the locality sensitive hashing framework. If we can find away to do it, I would love to have a try at implementing it. I can ask a professor I know about this possibility.

If we can't find a way to implement the inequality blocking as part of the locality sensitive hashing, we would likely have to add on the inequality as a filter step once the matching rows have been returned from the LSH algorithm. In this case, I think we shouldn't call it blocking as much as we should include it as part of the by argument (i.e. by = by(name_1 == name_2, year_1 > year_2)). I think that I would be very slightly less enthusiastic about this approach, but I would also be happy to implement it if you think there is sufficient demand for this kind of functionality.

Best, Ben

etiennebacher commented 6 months ago

Thanks for your detailed answer! I have to think more about this (also curious to know what the prof you mention would say).

If we can't find a way to implement the inequality blocking as part of the locality sensitive hashing, we would likely have to add on the inequality as a filter step once the matching rows have been returned from the LSH algorithm.

I don't think this is worth implementing in the package. Blocking on inequality conditions would be a nice feature of the package IMO, because it reduces the work that the LSH has to do, but filtering afterwards based on some condition is a simple operation and is not related to LSH anymore.

beniaminogreen commented 6 months ago

Gotcha. Sounds like we are totally of the same mind about this. I will continue searching to see if there's a way to integrate the inequality constraints with the LSH and will keep you posted if I find anything. Until then, I won't try and add any filters to the package.

beniaminogreen commented 2 months ago

I just realized that you can do this by ordering the rows you hash and performing the hashing in sequence, but unfortunately the sequence operations mean that this can't be parallelized (so may not be faster than the filtering approach on a powerful setup).

Basically, if you want to to ensure that a field year.x in dataset A is smaller than smaller than another field, year.y in dataset B, you can order the rows of A and B by the year fields, and then hash in order. Take the following datasets as an example (I have ordered them already for simplicity),

Dataset A:

Name Year.x
Ben 1990
Ben 2000
Jack 2011

Dataset B:

Name Year.y
Ben 1980
Ben 2010
Jack 2020

If you hash the keys in order, the following happens:

  1. We hash row one of dataset B. It collides with no rows of A and can be immediately discarded (it isn't stored in a hash table)
  2. We hash row one of dataset A. It has no collisions yet but could match to another row of dataset B so we store it in a hash table to be retrieved later
  3. We hash row two of dataset A. It has no collisions yet but could match to another row of dataset B so we store it in a hash table
  4. We hash row two of dataset B. It collides with rows one and two of dataset A so we consider these pairs a match and record them.
  5. We hash row three of dataset A. It has no collisions yet but could match to another row of dataset B so we store it in a hash table
  6. We hash row 3 of dataset B. It collides with row 3 of dataset A so we record this as a matching pair.

This works because at each point, only rows in A with a value of year less than the current value are stored in the hash table. We are only ever comparing rows of B to rows of A with a smaller value of year, enforcing the inequality constraint.

As I mentioned before, I don't think this can be easily parallelized because the order in which we do the hashing now matters. If there are many rows in each dataset with the same value of year, then it might be possible to parallelize over these observations.

etiennebacher commented 2 months ago

Thank you for the potential solution, I must say I don't have a usecase for this anymore so I won't try to benchmark the two approaches.

I'm wondering if using polars as a backend would make things faster. There is a mechanism to write extensions in Rust that can directly be applied on polars DataFrames and LazyFrames (so far this is only available in Python Polars but I hope those extensions can be soon possible in r-polars).

beniaminogreen commented 2 months ago

I would be very interested in trying this out! My only concern right now is that we are very close to the upper limit (5MB) that CRAN imposes on vendored dependencies, and I worry that adding the polars crate could push us over the edge. Do you have any experience distributing packages with large vendored dependencies on CRAN? Does the 5MB limit pose a serious constraint in practice, or are there ways to work around it?

best, Ben

etiennebacher commented 2 months ago

My only concern right now is that we are very close to the upper limit (5MB) that CRAN imposes on vendored dependencies, and I worry that adding the polars crate could push us over the edge

True, this could be an issue, although I'm not sure how many of polars several "subcrates" would be needed for this. Let's see in a few weeks / months if we have a working mechanism for extensions in r-polars.

Do you have any experience distributing packages with large vendored dependencies on CRAN? Does the 5MB limit pose a serious constraint in practice, or are there ways to work around it?

I personally don't, but I know some people who have packages > 5MB on CRAN. I think you have to argue on submission why this is reasonable (provided that you don't go over 20MB for instance). I also know there were some discussion to download crates from a "reliable source" (according to CRAN) and I think that Zenodo was mentioned as such a source. I think the discussion is on the R-devel list of those last months.