easystats / datawizard

Magic potions to clean and transform your data 🧙
https://easystats.github.io/datawizard/
Other
212 stars 16 forks source link

Improve docs for data_to_wide #506

Closed strengejacke closed 4 months ago

strengejacke commented 4 months ago

Background: https://x.com/stephenjwild/status/1792294171924967920 And I recently wanted to use data_to_wide() and couldn't get it work at first, took some time to figure out the logic behind reshaping to wide again. Especially, the documentation of id_cols was confusing.

strengejacke commented 4 months ago

but to me those pivot_*() functions made pivoting/unpivoting so much easier to understand

True, but I think by even makes it even clearer. by refers across our easystats-packages to a variable/column that indicates "grouping". Reshaping to wide format commonly, but not in general, identifies "repeated measurements" by an "ID", or group. And id_cols suggests plural, although often only one column is selected (that was also one of the criticism on Twitter). Thus, id_cols is not even the best choice from a grammar perspective, too ;-)

etiennebacher commented 4 months ago

By @strengejacke in a review comment:

I personally prefer by, both because it's in line with renaming the other arguments across functions into by. Especially, since by refers to a "grouping" variable, and imho, by makes clearer that this specified variables is used for gathering those rows ("groups") that are spread as new columns.

However, since we already allow select (easystats-name) and cols (tidyverse-name), we could probably use both argument names here, too?

etiennebacher commented 4 months ago

However, since we already allow select (easystats-name) and cols (tidyverse-name), we could probably use both argument names here, too?

I think we should deprecate and then remove one of the two. Those pivot functions already have a lot of args so I suggest we avoid adding bloat with aliases.

Thus, id_cols is not even the best choice from a grammar perspective, too ;-)

I don't think there is a best choice for this, one could use id_col but it would suggest this must be a single column while it's not always the case. Just for comparison with other packages:

Considering a by argument would be quite unique (doesn't mean that other packages always make things right, but it's quite indicative).

Curious about what others think about renaming id_cols to by @easystats/core-team

strengejacke commented 4 months ago

We slice and reshape by the levels of that variable - it's still my personal favorite, and we use this argument name now aggressively consistent in our packages. I don't mind deviating from other packages, especially since users are free to use those other packages if they prefer that syntax.

I just realize that when you teach R to beginners, it's really good if you can repeat yourself: "use select for..." or "use by for...". Easier to understand and remember for newbies

strengejacke commented 4 months ago

bump @DominiqueMakowski @IndrajeetPatil @bwiernik @rempsyc @mattansb

DominiqueMakowski commented 4 months ago

I think I agree with @etiennebacher; id_cols' purpose was to eventually specify an index, but i don't think it is conceptually different from what by is expected to do

Sorry @strengejacke 😅 - but I'm not convinced it should be replaced or made an alias

bwiernik commented 4 months ago

I don't quite understand the use of by as an alias for id_cols. Could you talk about the similarity there?

strengejacke commented 4 months ago

but i don't think it is conceptually different from what by is expected to do

Yes, that's why I thought we could use by, because we use it in conceptually similar ways across other functions now.

But I'm ok with sticking to id_cols, though I don't see the need to use that name. The id_cols refers to a variable, whose "levels" is used for grouping/stratification. In every other instance we now use by, not here though.

I think "re-thinking" reshaping data in this way makes it rather easy to use, especially since it would be in line with our other functions.

strengejacke commented 4 months ago

Btw, for data_to_long(), we have select for selecting columns, but also its alias cols for compatibility with pivot_longer()

bwiernik commented 4 months ago

I think the conceptual similarity of by to id_cols is very tenuous. I suggest we drop it

strengejacke commented 4 months ago

Ok, if nobody likes my suggestion (😢), let's stick to id_cols then.

strengejacke commented 4 months ago

by:

exterminador-do