TidierOrg / TidierData.jl

Tidier data transformations in Julia, modeled after the dplyr/tidyr R packages.
MIT License
86 stars 7 forks source link

Implement R `tidylog` package style of output #13

Open kdpsingh opened 1 year ago

kdpsingh commented 1 year ago

R has a wonderful tidylog package that outputs a log of how an operation modified a dataframe (e.g., "filter: 300 rows were removed (10%) of the data, with 2,700 rows remaining.")

I would like to implement this capability. I don't think that using TableMetadataTools.jl is necessarily the approach I want to take because this metadata should be printed (using @info or println) but does not need to be permanently stored as part of the data frame.

This will probably be implemented either using @aside or simply by wrapping the DataFrames.jl functions with a tidylog function that captures the state of the data frame before and after the operation and prints out the difference.

bkamins commented 1 year ago

Yes, if you do not want to store it in metadata then it is easier to just do logging (however, maybe you want to consider logging the changes in metadata as an opt-in - some users maybe would find it useful when doing lineage analysis?)

kdpsingh commented 1 year ago

This is a great point. I may consider adding this later. In my mental model, the logging is tied to operations rather than data frames. For example, a join is a single operation and it's not clear that either data frame would "own" that metadata.

I may first implement this in a logging style and then think through the implications of storing some or all of the results as metadata.

bkamins commented 1 year ago

a join is a single operation and it's not clear that either data frame would "own" that metadata.

I was thinking about it. The produced data frame "owns" the metadata as you need to know how it got created. Of course this is just food for thought for the future.

kdpsingh commented 1 year ago

Confirmed that tidylog is MIT License: https://github.com/elbersb/tidylog/issues/61

Will aim for a mostly line-by-line translation of tidylog in R.

While we could consider autodetecting changes in the data frames (and treat all verbs the same), I think the tidylog approach to customize the output for each verb feels more natural and is probably more efficient.