JuliaData / DataFrames.jl

In-memory tabular data in Julia
https://dataframes.juliadata.org/stable/
Other
1.72k stars 367 forks source link

Add arbitrary matching function for joins #3445

Open Gregstrq opened 2 months ago

Gregstrq commented 2 months ago

I have seen the discussion in #3363, and it was mentioned that adding arbitrary matching functions is part of 1.7 milestone. I would like to ask, what is the status of this feature request, are there any plans to implement it soon?

In principle, I could try to cook up a PR on my own, but I am not that familiar with DataFrames internals. Also the details of the syntax should be discussed in advance.

So, what is the principial difficulty with adding this feature? The way I see it, we just need to overload the equality comparison, which is used to relate the rows from the two DataFrames. Everything else, including the validation can stay without changes.

Now, there is a separate matter of the transformation of the columns, that was requested in #3363. I think, it makes sense to distinguish transformation and matching. We should really only match the columns of the same type between two dataframes, because we also need to match different rows inside the same dataframe to do the validation. And transformation should only provide the means to bring the columns to the same type: we first transform columns in one or both dataframes to the same type, and then run matching on the results of the transforms.

Transformaton can be already done by the users themselves simply by adding extra column. On the other hand, it is the arbitrary matching that requires tinkering with internals, but treating it separately from transformation simplifies the realization.

What do you think about it?

bkamins commented 2 months ago

See also https://github.com/JuliaData/DataFrames.jl/issues/2738 for a related discussion.

Two key things are:

  1. syntax
  2. in general this is O(n^2) operation, but we could improve the performance in many cases; however, to allow for this detection we need syntax (see point 1. above) that would in the future allow to distinguish slow from fast operations.
Gregstrq commented 2 months ago

in general this is O(n^2) operation, but we could improve the performance in many cases; however, to allow for this detection we need syntax (see point 1. above) that would in the future allow to distinguish slow from fast operations.

I don't understand this part. You need to do equality check to match the rows in two dataframes or to do validation. Equality check is a binary operation. What difference would it make to swap one binary operation with another one?

Could you may be provide the examples of slow and fast joins in the existing implementation?

  1. syntax

Why not simply pass :column_left=>binary_function=>:column_right to the on keyword?

bkamins commented 1 month ago

Could you may be provide the examples of slow and fast joins in the existing implementation?

Existing implementation is always fast because it uses equality that is hashable (in short, as we do not always actually hash things). If you can hash you can use dictionary for mapping.

However, for a general equality operator it does not have to be hashable, as e.g. it might not be transitive. In that case you need O(n^2) comparisons. An example of such equality operator is isapprox (a quite common case) or approximate distance matching of strings.

Why not simply pass :column_left=>binary_function=>:column_right to the on keyword?

Because, in general, it does not allow us to check if binary_function is a proper equality (i.e. if it transitive, reflexive and symmetric). We would need some syntax that would signal this and allow to e.g. hash :column_left to e.g. a dict and then do lookup o :column_right in this dict (this is just one example strategy that could be used, other are sorting, but then things would need to be sortable).