Closed rdboyes closed 1 year ago
Thank you so much for working on this. I'm going to work on fixing the parsing engine next then will come back to this.
There actually is a way to implement join_by as a "fake" function in a similar way to how across()
and desc()
are implemented. Once I remove all the string parsing, this will become more clear. We will do the same thing for the join macros.
One general design philosophy I'm trying to follow is that top-level code consists of macros, but helper code is all functions.
I'll take a look at how you did across()
/ desc()
- I'm still learning the ropes, Julia-wise :)
A more general organization question - should the join macros live in their own file? I can see the main Tidier.jl file getting unmanageably long very soon if more functionality continues to be added to the package
You're doing great. I would suggest waiting until I refactor. Hoping to finish up refactoring this weekend. Should make everything easier to read, including my implementation of across()
and desc()
.
Missed this, but everything works now with a function version of join_by
!
Thanks for your patience. I wanted to rewrite the parsing engine before looking at your code.
This looks wonderful. I haven't tested it on my machine yet but did get a chance to read through it and the tests you added. Looks good, and I think there are a few minor tweaks I may make before merging.
Had one question: this code doesn't handle multiple explicitly provided columns for joins, right?
For example, if someone wants to do join_by(a == b, c == d), or @left_join(df1, df2, (a, b)), I don't think that would work. I realize I left this out of my initial issue comment but is something I want to support.
I'm happy to add this to your PR if you're okay with that since I just added this functionality to across(). If you prefer to add it, happy to let you have a crack at it.
Once left join is finished, should be straightforward to add the rest!
Thanks again.
For example, if someone wants to do join_by(a == b, c == d), or @left_join(df1, df2, (a, b)), I don't think that would work. I realize I left this out of my initial issue comment but is something I want to support.
The parse_join_by
function in #12 addresses the multiple columns join. Not sure if it's the way we wish to approach but happy to work together!
Thanks @zzhaozheUM. I have never had to deal with competing PRs before, but I suppose this is a good problem to have.
My suggestion would be to put further commits on hold: I'm going to look through both PRs and meld them together in a way that makes them consistent with the rest of the package (following my rewrite).
Thank you to @rdboyes and @zzhaozheUM.
I'll plan to work off of this PR and will borrow parts of the other PR to make this work.
I may also build in support for FlexiJoins.jl, which I just came across, which appears to already support non-equi-joins (similar in many ways to the newest dplyr): https://aplavin.github.io/FlexiJoins.jl/test/examples.html
Sounds good! You’re correct about the code not currently supporting multiple specified columns. That was the next thing I was going to add, but figured I’d put it on hold in case the parsing rewrite meant that the code would have to change significantly. I’ll wait for your merge with the other PR - happy to work together @zzhaozheUM , I just wanted to work in my own branch so I could get familiar with how the code works.
Implemented a different way of doing this in PR #30.
This adds a first pass at left_join macros to enable the behaviour mentioned in https://github.com/kdpsingh/Tidier.jl/issues/11 along with unit tests that verify that the macros work as expected for these initial cases. Exports two new macros,
@left_join
and@join_by
- if we want to be able to use syntax likejoin_by("id1" == "id2")
I think join_by needs to be a macro (or at least, I don't know how to make it work as a function).Still to be added/tested: