JuliaData / DataFramesMeta.jl

Metaprogramming tools for DataFrames
https://juliadata.github.io/DataFramesMeta.jl/stable/
Other
479 stars 55 forks source link

More examples in docstrings #321

Open mahiki opened 2 years ago

mahiki commented 2 years ago

It would be very helpful if each macro/command had an example in the docstring.

For example, I've been having a lot of trouble using @rtranform to make a new column based on conditional aspects of other columns at the row level. The doctring in the REPL is:

help?> @rtransform
  @rtransform(x, args...)

  Row-wise version of @transform, i.e. all operations use @byrow by default. See @transform for details.

I didn't find the help I needed here, a good example would keep me moving along with my work.

I'm not a computer scientist, and its time-intensive to unravel how these excellent tools are implemented. I've invested time to apply julia in my work vs. the python ecosystem because of its great qualities, however the most consistent hurdle for me is the lack of examples.

Perhaps the usage is obvious to package developers, but for more pedestrian types like me nothing could be more illuminating than a good example.

bkamins commented 2 years ago

We should improve docstrings, but in the mean time maybe this https://bkamins.github.io/julialang/2021/11/19/dfm.html would help you?

mahiki commented 2 years ago

Ah yes, these are good, thank you @bkamins. To be fair the referred @transform and @byrow are helpful, it just requires more synthesis at an inconvenient time.

I guess the docstring for @rtransform could be like the following, should I submit a PR?

"""
    @rtransform(x, args...)

Row-wise version of @transform, i.e. all operations use @byrow by default. See @transform for details.

### Example
```jldoctest
julia> df = DataFrame(x=1:5, y=11:15)
5×2 DataFrame
 Row │ x      y
     │ Int64  Int64
─────┼──────────────
   1 │     1     11
   2 │     2     12
   3 │     3     13
   4 │     4     14
   5 │     5     15

julia> @rtransform(df, :a = 2 * :x, :b = :x * :y ^ 2)
5×4 DataFrame
 Row │ x      y      a      b
     │ Int64  Int64  Int64  Int64
─────┼────────────────────────────
   1 │     1     11      2    121
   2 │     2     12      4    288
   3 │     3     13      6    507
   4 │     4     14      8    784
   5 │     5     15     10   1125

"""

mahiki commented 2 years ago

I should definitely submit a PR. This is something I run into all the time, it would be fantastic to be in the habit of making contributions that people in the data engineering DE community would enjoy.

bkamins commented 2 years ago

Looks good. Thank you!

pdeffebach commented 2 years ago

Yes, this would be appreciated. But to clarify the problem some more, did you do ? @transform to look at the docstring for @transform? After being pointed to do so by the docstring for @rtransform.

bkamins commented 2 years ago

Even if @transform docstring is OK I think it is worth to improve @rtransform (and others)

mahiki commented 2 years ago

@pdeffebach I did ? @rtransform, and then saw the reference. I followed with @byrow and realized I had a lot of study to do before using this operation. Unfortunately I was working and didn't have an extra 1/2 hour and managed to solve my problem inelegantly.

So, yes I couldn't be bothered but at least it was for work reasons.

A little more about my workflow: I come from a SQL and Scala Spark background in my work, and a couple years ago I decided to 1) expunge all usage of excel and 2) incorporate julia into my work. This is swimming against the tide in a big way, since colleagues and the industry are fairly well completely locked into the python ecosystem.

I've had success in 2021 incorporating julia at my job, developing workflows in production with containerized environments. It's really a pleasure, especially the DataFrames syntax but additionally the package management. It comes with a lot of up front cost, like this process of learning DataFramesMeta commands, totally worth it in my view.

It would have been so convenient to see an example right in the REPL docs, so I'll contribute by filling in those gaps where I hit them.

pdeffebach commented 2 years ago

Thanks for the background.

Yes, makes sense. No use making people play Zork for docs. Please submit a PR adding examples!

mahiki commented 2 years ago

PR created!

mahiki commented 2 years ago

I don't understand the cause of the doctest failure, I'll read up on this. Probably a bit of missing syntax?

mahiki commented 2 years ago

I'm very happy to see the REPL examples in there, I think they are more effective than googling, reading doc pages, etc because of the diverted attention.

Now I've spent some time figuring some things out I made personal notes of the following equivalent dataframe tranformations. I needed column value assignments conditional on other rows. This shows how convenient and readable DataFramesMeta can be:

df = DataFrame(flag = [0, 1, 0, 1, 0, 1]
    , amt = [19.00, 11.00, 35.50, 32.50, 5.99, 5.99]
    , qty = [1, 4, 1, 3, 21, 109]
    , item = ["B001", "B001", "B020", "B020", "BX00", "BX00"]
    , day = Date.(["2021-01-01", "2021-01-01", "2112-12-12", "2020-10-20", "2021-05-04", "1984-07-04"])
    )

6×5 DataFrame
 Row │ flag   amt      qty    item    day        
     │ Int64  Float64  Int64  String  Date       
─────┼───────────────────────────────────────────
   1 │     0    19.0       1  B001    2021-01-01
   2 │     1    11.0       4  B001    2021-01-01
   3 │     0    35.5       1  B020    2112-12-12
   4 │     1    32.5       3  B020    2020-10-20
   5 │     0     5.99     21  BX00    2021-05-04
   6 │     1     5.99    109  BX00    1984-07-04

@rtransform(df
    , :Tax = :flag * 0.11 * :amt
    , :Discount = :item == "B020" ? -0.25 * :amt : 0
    )
transform(df
    , [:flag, :amt] => ByRow((x,y) -> x * 0.11 * y) => :Tax
    , [:item, :amt] => ByRow((x,y) -> x == "B020" ? -0.25 * y :  0) => :Discount
    )
transform(df
    , [:flag, :amt] => ((x,y) -> x * 0.11 .* y) => :Tax
    , [:item, :amt] => ((x,y) -> (x .== "B020") * -0.25 .* y ) => :Discount
    )

6×7 DataFrame
 Row │ flag   amt      qty    item    day         Tax      Discount 
     │ Int64  Float64  Int64  String  Date        Float64  Float64  
─────┼──────────────────────────────────────────────────────────────
   1 │     0    19.0       1  B001    2021-01-01   0.0       -0.0
   2 │     1    11.0       4  B001    2021-01-01   1.21      -0.0
   3 │     0    35.5       1  B020    2112-12-12   0.0       -8.875
   4 │     1    32.5       3  B020    2020-10-20   3.575     -8.125
   5 │     0     5.99     21  BX00    2021-05-04   0.0       -0.0
   6 │     1     5.99    109  BX00    1984-07-04   0.6589    -0.0

# OK I haven't figured out the broadcast operation with ternary operator, however the dfs pass `==` test.

I wonder if this example of comparative constructions would be useful in the DataFramesMeta documentation page? I really struggled to figure this out, but it looks so obvious now.

pdeffebach commented 2 years ago

This is mentioned in certain places. Check out the first code block here.

A PR on this section would be welcomed. I don't want to make the translations too prominent at the beginning because I don't want new users to get too intimidated. My ideal user is probably a first year masters student in the social sciences who is programming for the first time. It would be great to work on a PR for this in detail, but with those constraints in mind.

Additionally, remember MacroTools.@macroexpand, which is super useful for understanding DataFramesMeta.jl, albeit only for advanced users.

julia> using DataFramesMeta, Dates;

julia> df = DataFrame(flag = [0, 1, 0, 1, 0, 1]
           , amt = [19.00, 11.00, 35.50, 32.50, 5.99, 5.99]
           , qty = [1, 4, 1, 3, 21, 109]
           , item = ["B001", "B001", "B020", "B020", "BX00", "BX00"]
           , day = Date.(["2021-01-01", "2021-01-01", "2112-12-12", "2020-10-20", "2021-05-04", "1984-07-04"])
           );

julia> @rtransform(df
           , :Tax = :flag * 0.11 * :amt
           , :Discount = :item == "B020" ? -0.25 * :amt : 0
           );
julia> using MacroTools

julia> MacroTools.@macroexpand(@rtransform(df
           , :Tax = :flag * 0.11 * :amt
           , :Discount = :item == "B020" ? -0.25 * :amt : 0
           )) |> MacroTools.prettify
:((DataFrames).transform(df, DataFramesMeta.make_source_concrete([:flag, :amt]) => (ByRow(((waterbuffalo, gaur)->waterbuffalo * 0.11 * gaur)) => :Tax), DataFramesMeta.make_source_concrete([:item, :amt]) => (ByRow(((cod, fish)->if cod == "B020"
                          -0.25 * fish
                      else
                          0
                      end)) => :Discount)))
mahiki commented 2 years ago

This is great. The more time I spend in Julia the better I like it.

I think what the DataFramesMeta docs pages are missing is a simple front page that shows clear examples of how easy the syntax is to formulate for common tasks. Also a clear message about the mission of the package.

The difficulty from the new user's perspective:

Here's the first sentence of the Introduction on the repo REAME:

Metaprogramming tools for DataFrames.jl objects to provide more convenient syntax.

As a non-expert user, especially not knowing much about meta-programming, this already looks too advanced for me.

I recommend something more immediately obvious by saying something like:

Simplifies column and row transformations with natural syntax in column and row value assignments. For example, compare these two equivalent formulations:

df = DataFrame(x=1:5, y=11:15)

# DataFramesMeta syntax via assignment
@rtransform(df, :y = :x == 1 ? true : false)

# DataFrames typical pairs selector syntax with the ByRow() helper and anonymous function
transform(df, :x => ByRow(x -> x == 1 ? true : false) => :y)

This would make it easy to see what the purpose of this package is, I think.

mahiki commented 2 years ago

Pushed commit for @rorderby and @rsubset examples.

pdeffebach commented 2 years ago

Thanks! Ill take a look later today.