elixir-explorer / explorer

Series (one-dimensional) and dataframes (two-dimensional) for fast and elegant data exploration in Elixir
https://hexdocs.pm/explorer
MIT License
1.12k stars 123 forks source link

v0.9 names_from no longer included means loss of valuable information #967

Closed blset closed 3 months ago

blset commented 3 months ago

Explorer.Series.pivot_wider/4 no longer includes the names_from column name in the new columns when values_from is a list of columns. This is more consistent with its behaviour when values_from is a single column.

this is a change in v 0.9 which breaks my code based on a regular pattern in column names after pivot wider

ok it's my code, but wouldn't have it been more consistent to do the reverse ?

add names_from when values_from is a single column also ?

because having names_from in the generated name adds valuable information on how the column was composed, keeping the pattern very regular, this can be very useful to compute a pretty header with merged columns computed thanks to the regular names...

at minimum would it be possible to leave as it was in v0.8 ?

josevalim commented 3 months ago

Hi @blset, I believe we are simply mirroring the defaults in Polars here, right @philss? So that's a discussion that could be taken up there. But perhaps there is something else you can do.

cigrainger commented 3 months ago

names_from is pretty essential IMO. If that option was removed from polars that's a pretty big blow and I would be very keen to rectify it. IIUC we're doing a fair bit of work in that function anyway to accommodate the tidyr-style API from polars's pivot. So yeah let's hear from @philss what the state is and why that was removed and then we can figure out what to do.

josevalim commented 3 months ago

To be clear, we do have names_fromstill, but it is not part of the name_prefix by default. @blset, can you please include code snippets and their results before and after, so we get everyone on the same page. I believe you can get the previous result by setting name_prefix explicitly.

blset commented 3 months ago
acc
+--------------------------------------------------------+
|       Explorer DataFrame: [rows: 17, columns: 5]       |
+----------+------------+-------+-------------+----------+
|    t3    |    lsb     | refus | entreprises | grouping |
| <string> |  <string>  | <s64> |    <s64>    |  <s64>   |
+==========+============+=======+=============+==========+
| [1] 0    |            | 0     | 3487        | 0        |
+----------+------------+-------+-------------+----------+
| [1] 0    | enquêteurs | 2     | 103         | 0        |
+----------+------------+-------+-------------+----------+
| [1] 0    | total lsb  | 2     | 3590        | 1        |
+----------+------------+-------+-------------+----------+
| [2] 1-10 |            | 0     | 16569       | 0        |
+----------+------------+-------+-------------+----------+
| [2] 1-10 | enquêteurs | 4     | 1098        | 0        |
+----------+------------+-------+-------------+----------+

indicateurs_names #=> ["refus", "entreprises", "grouping"]

# fix to cope with one / many indicator column

x = "lsb"
# first_indicateur below is a dummy indicator name in case n==1  to be consistent with case n >1
# where the name of the variable that goes in column is part of the new name (in 0.8)

  prefix =
   case length(indicateurs_names) do
     n when n > 1 -> "#{x}@"
     n when n == 1 -> "#{x}@#{first_indicateur}_#{x}_" 
   end

# pass x="lsb" in column

0.8

acc.names #=> ["t3", "lsb", "refus", "entreprises", "grouping"]
acc = DF.pivot_wider(acc, x, indicateurs_names, names_prefix: prefix)

acc.names #=> ["t3", "lsb@refus_lsb_manquant", "lsb@refus_lsb_enquêteurs",
 "lsb@refus_lsb_total#lsb", "lsb@refus_lsb_entreprises",
 "lsb@entreprises_lsb_manquant", "lsb@entreprises_lsb_enquêteurs",
 "lsb@entreprises_lsb_total#lsb", "lsb@entreprises_lsb_entreprises",
 "lsb@grouping_lsb_manquant", "lsb@grouping_lsb_enquêteurs",
 "lsb@grouping_lsb_total#lsb", "lsb@grouping_lsb_entreprises"]

0.9.1
acc.names #=> ["t3", "lsb", "refus", "entreprises", "grouping"]
acc = DF.pivot_wider(acc, x, indicateurs_names, names_prefix: prefix)

acc.names #=> ["t3", "lsb@refus_manquant", "lsb@refus_enquêteurs", "lsb@refus_total#lsb",
 "lsb@refus_entreprises", "lsb@entreprises_manquant",
 "lsb@entreprises_enquêteurs", "lsb@entreprises_total#lsb",
 "lsb@entreprises_entreprises", "lsb@grouping_manquant",
 "lsb@grouping_enquêteurs", "lsb@grouping_total#lsb",
 "lsb@grouping_entreprises"]

the prefix lsb_ which was added automatically in case many indicator columns and manually in this code in case on indicator column, is now in 0.9 absent in the many indicator case the fix is probabely not difficult in this code but it's annoying to no to rely on not informative naming anyway, if you ant to further compute a table layout with merged columns based on names

blset commented 3 months ago

see also https://github.com/elixir-explorer/explorer/issues/838

josevalim commented 3 months ago

From my understanding, the behaviour you are seeing is consistent with pivot_wider, which is generally our guide for these kind of decisions. Or am I missing something?

If that's the case, it is unlikely we will change it.

blset commented 3 months ago

current polar implementation does not follow tidy R pivot_wider and pivot_wider is less complete than polars for what concerns completeness of generated names

the ideal totally non ambiguous pattern is prefix name of variable in column modality of the variable in column_ name of value with prefix an optional global prefix, and name of value maybe just after prefix instead of at the end

tidy R pivot wider does not show the name of the variable in column, only its modalities polar is more precise but not in the one value case

data2 =
  data.frame(
    weekday=c(
      "Monday",
      "Tuesday",
      "Wednesday",
      "Thursday",
      "Friday",
      "Monday",
      "Tuesday",
      "Wednesday",
      "Thursday",
      "Friday"
    ),
    team= c("A", "B", "C", "A", "B", "C", "A", "B", "C", "A"),
    hour= c(10, 9, 10, 10, 11, 15, 14, 16, 14, 16),
    score= c(0, 9, 0, 0, 1, 5, 4, 6, 4, 6),
    league = c("L", "L", "L", "L", "L", "L", "L", "L", "L", "L")
  )

one value case (hour), no name of weekday in generated column names !!

pivot_wider(data2, names_from="weekday", values_from = c("hour"))

team  score league Monday Tuesday Wednesday Thursday Friday
  <chr> <dbl> <chr>   <dbl>   <dbl>     <dbl>    <dbl>  <dbl>
1 A         0 L          10      NA        NA       10     NA
2 B         9 L          NA       9        NA       NA     NA

in polars weekday@Monday,| weekday@Tuesday missing hour somewhere

many values case (hour, score), names generated completely

pivot_wider(data2, names_from="weekday", values_from = c("hour", "score")) |> names()

 [1] "team"            "league"          "hour_Monday"     "hour_Tuesday"   
 [5] "hour_Wednesday"  "hour_Thursday"   "hour_Friday"     "score_Monday"   
 [9] "score_Tuesday"   "score_Wednesday" "score_Thursday"  "score_Friday"   

in polar hour_weekday_Monday, score_weekday_Monday, etc...

many values (hour, score) and one modality for variable in column (league), names generated completely


pivot_wider(data2, names_from="league", values_from = c("hour", "score")) |> names()
[1] "weekday" "team"    "hour_L"  "score_L"

in polar hour_weekday_Monday etc..

one value (hour) , and one modality (league), names not generated completely !!


pivot_wider(data2, names_from="league", values_from = c("hour")) |> names()
[1] "weekday" "team"    "score"   "L"      

in polar league@L

josevalim commented 3 months ago

current polar implementation does not follow tidy R pivot_wider

Yes, that's what I am saying. We tend to expose tidy/dplyr semantics on the Explorer.DataFrame side of things, and your examples above reflect that. Especially because we see Polars as one of the possible backends.

I understand that you would prefer to have Polars semantics, but that's not our goal for now, so I think we can close this.

In any case, I really appreciate your time and extensive explanation here, I have definitely learned new things.

josevalim commented 3 months ago

Something we could also do is to provide other configuration values, such as names_joiner/names_sep, which you could use to attach the column name. But as per the default behaviour, this matches our expectations!

blset commented 3 months ago

1/ one useful configuration is then to be able to prefix the modality with the name of the variable having this modality

so with names_from VAR1 (modalities A and B) and VAR2 (modalities A B C with, gosh!, same names happening with no relation between VAR1 and VAR2)

and values_from HOUR and SCORE

without prefixing the modality (name clash)


HOUR_A HOUR_B HOUR_A HOUR_B HOUR_C 

SCORE_A SCORE_B SCORE_C SCORE_B SCORE_C 

with prefixing the modality, no name clash ever possible


HOUR_VAR1_A HOUR_VAR1_B HOUR_VAR2_A HOUR_VAR2_B HOUR_VAR2_C

SCORE_VAR1_A SCORE_VAR1_B SCORE_VAR2_A SCORE_VAR2_B SCORE_VAR2_C

2/ the other useful configuration is to be able to keep the name of the value_from variable even if it is alone

HOUR_A HOUR_B instead of A and B alone

if HOUR is the only value_from

why is it useful ?

maybe the names are disambiguated automatically in case of repeated names using suffix 1 2 3 etc... but this is not usable programmatically

suppose you you need to know what are the column positions of all the columns coming from modalities of VAR1 you are stuck without prefixing the modalities of VAR1 with VAR1

moreover if the name of the value_from variable are mentioned in the generated names in pivot_wider it does mean you need to know where it is in the result,

well, same for names_from, but mentioning only the modality is incomplete, without prefixing the modality with its original variable name

Polars is maybe one possible backend but it is more precise on this point and R pivot_wider as Elixir explorer should get inspired in my opinion.

and the recent inlining of Polars pivot_wider as in the case of one value_from that happened in 0.9 is objectively a loss of information

thanks