JuliaData / DataFrames.jl

In-memory tabular data in Julia
https://dataframes.juliadata.org/stable/
Other
1.71k stars 360 forks source link

Anti join not working as intended for joins on multiple keys? #821

Closed amellnik closed 9 years ago

amellnik commented 9 years ago

This may be some confusion due to my inexperience with database style joins, but it appears that the anti join is not working as advertised. The doc says that

Anti: The output contains rows for values of the key that exist in the first (left) but not the second (right) argument to join. As with semi joins, output is restricted to columns from the first (left) argument.

And it works as expected for single key joins:

df1 = DataFrame(A = [1,2,3,4,5], B = ["a", "b", "c", "d", "e"])
df2 = DataFrame(A = [1,3,4], B = ["a", "b", "y"])
join(df1, df2, on=[:A], kind=:anti)
Row A B
1 2 "b"
2 5 "e"

For joins on multiple keys I would expect it would act as if I was joining on a single key which is every possible linear combination of component keys. For the example here I would expect this to return a single row, but that's not the case:

df1 = DataFrame(A = [1,2,3,4,5], B = ["a", "b", "c", "d", "e"])
df2 = DataFrame(A = [1,3,4], B = ["a", "b", "y"])
println(join(df1, df2, on=[:A,:B], kind=:anti))
Row A B
1 2 "b"
2 3 "c"
3 4 "d"
4 5 "e"

Is this a bug or is there something I'm missing?

johnmyleswhite commented 9 years ago

I kind of suspect every join on multiple keys is broken in some circumstances. Would be good to know if the problem is specific to anti-join or not.

garborg commented 9 years ago

The current behavior seems right to me.

Let's look at the matching semi-join:

join(df1, df2, on=[:A,:B], kind=:semi)
1x2 DataFrame
| Row | A | B   |
|-----|---|-----|
| 1   | 1 | "a" |

That's the only match between the the dfs on columns A & B, so so far so good, and I'm not familiar with relational algebra, so there may be better terms, but the results of corresponding semi- and anti- joins are disjoint and the union of the two result sets is identical (w/o row order) to the original table.

Does that make sense, or am I missing something in the example?

amellnik commented 9 years ago

@garborg -- Sorry, I screwed up and confused semi and anti -- I just realized that my example is showing the expected behavior, but my actual data set appears to not be working. Let me take another look and see if I can reproduce or if I'm just misunderstanding the data.

garborg commented 9 years ago

Excellent, please post your findings.

Be on the lookout for this pretty bad bug (maybe the same underlying machinery) if your data set has many unique keys: https://github.com/JuliaStats/DataFrames.jl/pull/815#issuecomment-112977331. (It would be great to merge a fix for the linked bug as soon as fixes are in place to get CI passing. Unfortunately I'm rather busy with unrelated work, and don't have cycles for the context switch now, so I can't drive the fix.)

amellnik commented 9 years ago

To follow up, this whole thing was based on my misunderstanding of the data set I was using -- the duplicates I was trying to work on were already being filtered by a flag that wasn't working as advertised. I'll keep an eye out for that bug. One thing I have noticed is sometimes my kernel dies when I run complex by's. It's only happened three times so far and I assumed it was related to the absurdly-large size of this data set, but I'll keep an eye out and see if I can figure out anything more specific. Thanks -A

garborg commented 9 years ago

Thanks for the followup, @amellnik.