TidierOrg / Tidier.jl

Meta-package for data analysis in Julia, modeled after the R tidyverse.
MIT License
515 stars 14 forks source link

Purrr.jl or similar? #138

Closed vituri closed 1 month ago

vituri commented 5 months ago

I've been using TidierDB to query databases recently (this was the last feature that I was missing in Julia compared to R), and things are going great. Thanks for all the work on Tidier!

Now I am writing some pipelines in Julia using REST APIs and I am dealing more with JSON files. I was thinking if there will be a "purrr"-analogue in Julia do manipulate lists (dicts? arrays? and the sort of nested objects). In R, I use functions like "list_flatten", "modify_if" all the time to get non-relational data from jsons and be able to write them in a relational database, but they are very slow compared to more "loop-based" manipulations. In Julia we won't have this problem.

I believe that many purrr functions will be """easy""" to translate to Julia without using macros, and I will be happy to contribute.

What do you think?

kdpsingh commented 5 months ago

Thanks for using Tidier and for sharing thoughts on the functionality from the purrr package.

Although I have worked a bit with ragged JSON-like data, I always have to look up how to do things in purrr mainly because I never firmed up my mental model for the full purrr package. Because Julia has performant loops, this is less of an issue (as you point out), but the purrr syntax is still much more concise.

We don't have anyone working on a purrr-equivalent in Tidier.jl. I'd be open to adding such a repo to the Tidier family.

If you decide to lead this, I would suggest starting a repo on your own GitHub account. When it has sufficient functionality that it's ready for a first release (doesn't have to be feature complete but should have enough functionality to be useful), let me know and I'll review. At that point, we would transfer the repo to TidierOrg (with you as a member) so that you could continue to lead work on it with others hopefully pitching in to help with documentation.

Keep us posted on whether you decide to take this on!

vituri commented 5 months ago

Hi, @kdpsingh !

Here is a first version o "JPurrr", the Julia version of purrr:

https://vituri.github.io/JPurrr.jl/dev/

I will write the examples later, but all functions are documented and trivial to use if you know the purrr package. It took me more time to understand Documenter than to write the package; I am used to Quarto/RMarkdown.

I implemented the majority of map_*, imap_* and map2_* functions, along with modify and keep, but I will need some thinking regarding list_flatten. As you know, in R named lists and dataframes are all lists, so we can always iterate as usual. But what does it means in Julia to "flatten" a "list"? Maybe adding more methods to cover Julia dicts, arrays and dataframes in a nice way? I don't know yet.

There is a very nice cheat-sheet here that we can try to mimick:

https://github.com/rstudio/cheatsheets/blob/main/purrr.pdf

kdpsingh commented 5 months ago

This is great! Amazing that you whipped this up so quickly. Haven't played with it yet but the docs look familiar. Excited to try it out soon when I get some downtime.

Can we modify the package name to fit the Tidier naming scheme? My goal has been to make the package names more obvious with respect to their purpose, and rooted in Julia.

How about one of these:

Or

Thoughts?

Once we finalize name, will have you transfer the repo to TidierOrg and will be invite you to be a member of the org if that is something you're okay with.

vituri commented 5 months ago

TidierMap sound really good! According to the purrr docs, its goal is to provide "a complete and consistent set of tools for working with functions and vectors", and the word "map" gives this idea. There are even some "adverbs" which are very useful: https://purrr.tidyverse.org/reference/index.html#adverbs .

Here is the new doc: https://vituri.github.io/TidierMap.jl/dev/

and the package address: https://github.com/vituri/TidierMap.jl

I'll be happy to help work on this package within your org!

rdboyes commented 5 months ago

It's hard to get a single name that makes sense for all of purrr, since it's more "three packages in a trenchcoat" than it is one single package, but I would suggest TidierFunctions.jl - the core functionality of purrr to me is the ability to work with functions: the various map and walk variants, quietly, safely, etc.

In the other TidierX names, the X is the thing we're working with (Plots, Data, etc.)

kdpsingh commented 5 months ago

Thanks @rdboyes. They main issue I have with the name TidierFunctions is that in R, the two main functions (map and walk) are intended to operate on R's lists (using functions). I agree that safely/possibly do work on functions, but safely/possibly are essentially there to make it possible to apply map/walk without a single bad iteration breaking the whole thing. So I still view map/walk as the main functionality.

You could view map/walk as applying functions, but the lists are the star of the show.

Julia doesn't have the equivalent structure as a list -- named lists in R are kind of like nested named tuples in Julia. But non-named lists could be represented by either tuples or arrays in Julia, or any nested collection. So any "ragged array"-type object is the object of interest for this package (like JSON objects). This is why TidierCollections feels a bit closer to the mark.

Agree that TidierMap doesn't quite do this concept justice but it's closer IMO than TidierFunctions.

Open to discussion and other suggestions. Happy to move this discussion to our Julia Slack channel.

kdpsingh commented 5 months ago

@vituri, would love to have you join the #tidier channel on Julia Slack (https://julialang.org/slack/). It's a great place for discussion on all things Tidier.