TidierOrg / TidierData.jl

Tidier data transformations in Julia, modeled after the dplyr/tidyr R packages.
MIT License
86 stars 7 forks source link

request - @chinWithMutate #114

Open Lincoln-Hannah opened 2 months ago

Lincoln-Hannah commented 2 months ago

Would you consider creating a @chainWithMutate macro that has one difference to the standard @chain macro. If a line begins with variablename = and a DataFrame is passed from the line above, then it treats it like line starting with @mutate So instead of writing;

@chain begin
    DataFrame(a=1:10)

    @mutate  b = 2a
    @mutate  c = 3b
end

one could just write

@chainWithMutate begin
    DataFrame(a=1:10)

    b = 2a
    c = 3b
end
kdpsingh commented 1 month ago

Hi @Lincoln-Hannah, sorry for the delay in getting back to you. This is a solid idea - I want to share some initial thoughts on why @mutate() currently functions the way it does, and how we might get closer to what you are looking for.

Right now, @mutate() supports the multi-line syntax you propose here but doesn't support situations where one argument relies on a variable that was created in a previous argument. In the above example, c = 3b relies on the existence of b, which was created in the previous argument. The functionality as currently implemented is intentional because this limitation comes from DataFrames.transform(). This is implemented for a performance reason -- namely, that DataFrames assumes that arguments can be parallelized and thus run faster.

There are 2 ways that we could fix this:

  1. Implement the @chainwithmutate() macro you propose above: I don't like the name (because it would be used inside an existing @chain macro) but we could consider an alternative name like @mutates(), where the s makes it look plural and stands for "sequential".
  2. The second approach, which I would strongly prefer, is for the @mutate() macro to analyze the variables being created (e.g., b and c) and the variables being used (e.g.,a and b) and to automatically run them sequentially in separate calls to DataFrames.transform() if a dependency is detected.

This would be more of a new feature than a bug-fix, so it's slightly lower priority, but I think that option 2 is doable and is something we should pursue.

Lincoln-Hannah commented 1 month ago

Option 2 is fine. Thank you for considering it.

@chainWithMutate would be more difficult to implement. The idea is it can be used instead of a @chain macro (not sit within one). All the other macros would work within it. But if a line started with variable = it would be treated as a @mutate line. I found that 2/3 of the lines I write within a @chain block are @mutate lines and often they are interspersed with @filter and other macros. It would just be cleaner if I didn't have to keep repeating @mutate .

@chainWithMutate begin

       DataFrame( a=1:10 )

       b = 2a

      @filter   b > 10

      c = 2b

end
kdpsingh commented 1 month ago

Ah I see what you mean. We probably won't add this macro to the package but it's definitely doable. I can try to put together a code snippet as a starting point if that would be of interest.

Lincoln-Hannah commented 1 month ago

Very much so. I really think if people used it, they would like it. There are so many @chain blocks I've written with lots of @mutate lines interspersed with @filter @pivot and @join lines. Not having to write @mutate every time would save a lot of code.