Closed strengejacke closed 4 months ago
but to me those
pivot_*()
functions made pivoting/unpivoting so much easier to understand
True, but I think by
even makes it even clearer. by
refers across our easystats-packages to a variable/column that indicates "grouping". Reshaping to wide format commonly, but not in general, identifies "repeated measurements" by an "ID", or group. And id_cols
suggests plural, although often only one column is selected (that was also one of the criticism on Twitter). Thus, id_cols
is not even the best choice from a grammar perspective, too ;-)
By @strengejacke in a review comment:
I personally prefer by, both because it's in line with renaming the other arguments across functions into by. Especially, since by refers to a "grouping" variable, and imho, by makes clearer that this specified variables is used for gathering those rows ("groups") that are spread as new columns.
However, since we already allow select (easystats-name) and cols (tidyverse-name), we could probably use both argument names here, too?
However, since we already allow select (easystats-name) and cols (tidyverse-name), we could probably use both argument names here, too?
I think we should deprecate and then remove one of the two. Those pivot functions already have a lot of args so I suggest we avoid adding bloat with aliases.
Thus, id_cols is not even the best choice from a grammar perspective, too ;-)
I don't think there is a best choice for this, one could use id_col
but it would suggest this must be a single column while it's not always the case. Just for comparison with other packages:
base::reshape()
: idvardata.table::melt()
: id.varspandas
and polars
: indexConsidering a by
argument would be quite unique (doesn't mean that other packages always make things right, but it's quite indicative).
Curious about what others think about renaming id_cols
to by
@easystats/core-team
We slice and reshape by
the levels of that variable - it's still my personal favorite, and we use this argument name now aggressively consistent in our packages. I don't mind deviating from other packages, especially since users are free to use those other packages if they prefer that syntax.
I just realize that when you teach R to beginners, it's really good if you can repeat yourself: "use select for..." or "use by for...". Easier to understand and remember for newbies
bump @DominiqueMakowski @IndrajeetPatil @bwiernik @rempsyc @mattansb
I think I agree with @etiennebacher; id_cols' purpose was to eventually specify an index, but i don't think it is conceptually different from what by
is expected to do
Sorry @strengejacke 😅 - but I'm not convinced it should be replaced or made an alias
I don't quite understand the use of by
as an alias for id_cols
. Could you talk about the similarity there?
but i don't think it is conceptually different from what
by
is expected to do
Yes, that's why I thought we could use by
, because we use it in conceptually similar ways across other functions now.
But I'm ok with sticking to id_cols
, though I don't see the need to use that name. The id_cols
refers to a variable, whose "levels" is used for grouping/stratification. In every other instance we now use by
, not here though.
I think "re-thinking" reshaping data in this way makes it rather easy to use, especially since it would be in line with our other functions.
Btw, for data_to_long()
, we have select
for selecting columns, but also its alias cols
for compatibility with pivot_longer()
I think the conceptual similarity of by
to id_cols
is very tenuous. I suggest we drop it
Ok, if nobody likes my suggestion (😢), let's stick to id_cols
then.
by
:
Background: https://x.com/stephenjwild/status/1792294171924967920 And I recently wanted to use
data_to_wide()
and couldn't get it work at first, took some time to figure out the logic behind reshaping to wide again. Especially, the documentation ofid_cols
was confusing.