Closed blset closed 3 months ago
Hi @blset, I believe we are simply mirroring the defaults in Polars here, right @philss? So that's a discussion that could be taken up there. But perhaps there is something else you can do.
names_from
is pretty essential IMO. If that option was removed from polars
that's a pretty big blow and I would be very keen to rectify it. IIUC we're doing a fair bit of work in that function anyway to accommodate the tidyr
-style API from polars
's pivot
. So yeah let's hear from @philss what the state is and why that was removed and then we can figure out what to do.
To be clear, we do have names_from
still, but it is not part of the name_prefix
by default. @blset, can you please include code snippets and their results before and after, so we get everyone on the same page. I believe you can get the previous result by setting name_prefix
explicitly.
acc
+--------------------------------------------------------+
| Explorer DataFrame: [rows: 17, columns: 5] |
+----------+------------+-------+-------------+----------+
| t3 | lsb | refus | entreprises | grouping |
| <string> | <string> | <s64> | <s64> | <s64> |
+==========+============+=======+=============+==========+
| [1] 0 | | 0 | 3487 | 0 |
+----------+------------+-------+-------------+----------+
| [1] 0 | enquêteurs | 2 | 103 | 0 |
+----------+------------+-------+-------------+----------+
| [1] 0 | total lsb | 2 | 3590 | 1 |
+----------+------------+-------+-------------+----------+
| [2] 1-10 | | 0 | 16569 | 0 |
+----------+------------+-------+-------------+----------+
| [2] 1-10 | enquêteurs | 4 | 1098 | 0 |
+----------+------------+-------+-------------+----------+
indicateurs_names #=> ["refus", "entreprises", "grouping"]
# fix to cope with one / many indicator column
x = "lsb"
# first_indicateur below is a dummy indicator name in case n==1 to be consistent with case n >1
# where the name of the variable that goes in column is part of the new name (in 0.8)
prefix =
case length(indicateurs_names) do
n when n > 1 -> "#{x}@"
n when n == 1 -> "#{x}@#{first_indicateur}_#{x}_"
end
# pass x="lsb" in column
0.8
acc.names #=> ["t3", "lsb", "refus", "entreprises", "grouping"]
acc = DF.pivot_wider(acc, x, indicateurs_names, names_prefix: prefix)
acc.names #=> ["t3", "lsb@refus_lsb_manquant", "lsb@refus_lsb_enquêteurs",
"lsb@refus_lsb_total#lsb", "lsb@refus_lsb_entreprises",
"lsb@entreprises_lsb_manquant", "lsb@entreprises_lsb_enquêteurs",
"lsb@entreprises_lsb_total#lsb", "lsb@entreprises_lsb_entreprises",
"lsb@grouping_lsb_manquant", "lsb@grouping_lsb_enquêteurs",
"lsb@grouping_lsb_total#lsb", "lsb@grouping_lsb_entreprises"]
0.9.1
acc.names #=> ["t3", "lsb", "refus", "entreprises", "grouping"]
acc = DF.pivot_wider(acc, x, indicateurs_names, names_prefix: prefix)
acc.names #=> ["t3", "lsb@refus_manquant", "lsb@refus_enquêteurs", "lsb@refus_total#lsb",
"lsb@refus_entreprises", "lsb@entreprises_manquant",
"lsb@entreprises_enquêteurs", "lsb@entreprises_total#lsb",
"lsb@entreprises_entreprises", "lsb@grouping_manquant",
"lsb@grouping_enquêteurs", "lsb@grouping_total#lsb",
"lsb@grouping_entreprises"]
the prefix lsb_ which was added automatically in case many indicator columns and manually in this code in case on indicator column, is now in 0.9 absent in the many indicator case the fix is probabely not difficult in this code but it's annoying to no to rely on not informative naming anyway, if you ant to further compute a table layout with merged columns based on names
From my understanding, the behaviour you are seeing is consistent with pivot_wider, which is generally our guide for these kind of decisions. Or am I missing something?
If that's the case, it is unlikely we will change it.
current polar implementation does not follow tidy R pivot_wider and pivot_wider is less complete than polars for what concerns completeness of generated names
the ideal totally non ambiguous pattern is prefix name of variable in column modality of the variable in column_ name of value with prefix an optional global prefix, and name of value maybe just after prefix instead of at the end
tidy R pivot wider does not show the name of the variable in column, only its modalities polar is more precise but not in the one value case
data2 =
data.frame(
weekday=c(
"Monday",
"Tuesday",
"Wednesday",
"Thursday",
"Friday",
"Monday",
"Tuesday",
"Wednesday",
"Thursday",
"Friday"
),
team= c("A", "B", "C", "A", "B", "C", "A", "B", "C", "A"),
hour= c(10, 9, 10, 10, 11, 15, 14, 16, 14, 16),
score= c(0, 9, 0, 0, 1, 5, 4, 6, 4, 6),
league = c("L", "L", "L", "L", "L", "L", "L", "L", "L", "L")
)
one value case (hour), no name of weekday in generated column names !!
pivot_wider(data2, names_from="weekday", values_from = c("hour"))
team score league Monday Tuesday Wednesday Thursday Friday
<chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 0 L 10 NA NA 10 NA
2 B 9 L NA 9 NA NA NA
in polars weekday@Monday,| weekday@Tuesday missing hour somewhere
many values case (hour, score), names generated completely
pivot_wider(data2, names_from="weekday", values_from = c("hour", "score")) |> names()
[1] "team" "league" "hour_Monday" "hour_Tuesday"
[5] "hour_Wednesday" "hour_Thursday" "hour_Friday" "score_Monday"
[9] "score_Tuesday" "score_Wednesday" "score_Thursday" "score_Friday"
in polar hour_weekday_Monday, score_weekday_Monday, etc...
many values (hour, score) and one modality for variable in column (league), names generated completely
pivot_wider(data2, names_from="league", values_from = c("hour", "score")) |> names()
[1] "weekday" "team" "hour_L" "score_L"
in polar hour_weekday_Monday etc..
one value (hour) , and one modality (league), names not generated completely !!
pivot_wider(data2, names_from="league", values_from = c("hour")) |> names()
[1] "weekday" "team" "score" "L"
in polar league@L
current polar implementation does not follow tidy R pivot_wider
Yes, that's what I am saying. We tend to expose tidy/dplyr semantics on the Explorer.DataFrame side of things, and your examples above reflect that. Especially because we see Polars as one of the possible backends.
I understand that you would prefer to have Polars semantics, but that's not our goal for now, so I think we can close this.
In any case, I really appreciate your time and extensive explanation here, I have definitely learned new things.
Something we could also do is to provide other configuration values, such as names_joiner/names_sep, which you could use to attach the column name. But as per the default behaviour, this matches our expectations!
1/ one useful configuration is then to be able to prefix the modality with the name of the variable having this modality
so with names_from VAR1 (modalities A and B) and VAR2 (modalities A B C with, gosh!, same names happening with no relation between VAR1 and VAR2)
and values_from HOUR and SCORE
without prefixing the modality (name clash)
HOUR_A HOUR_B HOUR_A HOUR_B HOUR_C
SCORE_A SCORE_B SCORE_C SCORE_B SCORE_C
with prefixing the modality, no name clash ever possible
HOUR_VAR1_A HOUR_VAR1_B HOUR_VAR2_A HOUR_VAR2_B HOUR_VAR2_C
SCORE_VAR1_A SCORE_VAR1_B SCORE_VAR2_A SCORE_VAR2_B SCORE_VAR2_C
2/ the other useful configuration is to be able to keep the name of the value_from variable even if it is alone
HOUR_A HOUR_B instead of A and B alone
if HOUR is the only value_from
why is it useful ?
maybe the names are disambiguated automatically in case of repeated names using suffix 1 2 3 etc... but this is not usable programmatically
suppose you you need to know what are the column positions of all the columns coming from modalities of VAR1 you are stuck without prefixing the modalities of VAR1 with VAR1
moreover if the name of the value_from variable are mentioned in the generated names in pivot_wider it does mean you need to know where it is in the result,
well, same for names_from, but mentioning only the modality is incomplete, without prefixing the modality with its original variable name
Polars is maybe one possible backend but it is more precise on this point and R pivot_wider as Elixir explorer should get inspired in my opinion.
and the recent inlining of Polars pivot_wider as in the case of one value_from that happened in 0.9 is objectively a loss of information
thanks
this is a change in v 0.9 which breaks my code based on a regular pattern in column names after pivot wider
ok it's my code, but wouldn't have it been more consistent to do the reverse ?
add names_from when values_from is a single column also ?
because having names_from in the generated name adds valuable information on how the column was composed, keeping the pattern very regular, this can be very useful to compute a pretty header with merged columns computed thanks to the regular names...
at minimum would it be possible to leave as it was in v0.8 ?