Closed blset closed 3 months ago
Hi @blset,
I feel like what the names_prefix
option is doing is reasonable. In the case where pivot_wider
receives a single column for the values_from
argument, there is no need to include a reference to the original column name in the new column names by default. It's only when you pass a list of columns for values_from
that the new column names need to be disambiguated.
I also didn't follow what you meant by this:
this is a problem when you want to dynamically manipulate the names afterwards, for instance to generate a header with span.
If there's a specific issue you're having, could you post an example of that?
In the case pivot_wider receives a single column for the values_from, there is indeed no need to include a reference to the original column name, for what concerns the naming of columns without any ambiguity.
But for what concerns semantics, you loose a name, and you don't formally know anymore what is inside the table. In my example d2 above, you don't know that the table contains hours.
the same for a variable with only one modality that gets in column.
I'm using pivot_wider to build a dynamic table constructor from dynamic sql. the sql computes everything with regular group by, then with pivot_wider i can afterwards position the group variables either in row (where there are by default) or in column.
then for the html rendering of the table, when there is one or more "in column" variables, I need to compute the span for each floor of the hierarchy and other things, I need to reorder the columns, etc.. and for that I need the naming to be consistent, because it is then a simple function of the names of the dataframe.
the names after pivot_wider are consistent unless there is only one value_for or if there is a variable in column with one modality only (that disappears from the naming also in some circumstances). All these edge cases highly complicate post processing the names of the table
it would be so much easier if the naming was consistent.
And if you think about it, with two values_from, ambiguity could be removed by a simple 1 and 2 suffix. I f you use the names, that is because it helps remember what's inside the table. Well it's the same with one value_from, we need a clue for what's in the table that is encoded in the name.
thanks
I understand what you mean by losing information. But I feel like no prefix is a reasonable default for the single column case since you can always use the names_prefix
option if you want. That's what it's there for:
require Explorer.DataFrame, as: DF
df = DF.new(
weekday: ["Mon", "Tue", "Wed", "Thu", "Fri", "Mon", "Tue", "Wed", "Thu", "Fri"],
team: ["A", "B", "C", "A", "B", "C", "A", "B", "C", "A"],
hour: [10, 9, 10, 10, 11, 15, 14, 16, 14, 16],
score: [0, 9, 0, 0, 1, 5, 4, 6, 4, 6],
league: ["L", "L", "L", "L", "L", "L", "L", "L", "L", "L"]
)
Without names_prefix
:
df |> DF.pivot_wider("weekday", "hour") |> DF.print(limit: :infinity)
# +---------------------------------------------------------------------+
# | Explorer DataFrame: [rows: 9, columns: 8] |
# +----------+-------+----------+-------+-------+-------+-------+-------+
# | team | score | league | Mon | Tue | Wed | Thu | Fri |
# | <string> | <s64> | <string> | <s64> | <s64> | <s64> | <s64> | <s64> |
# +==========+=======+==========+=======+=======+=======+=======+=======+
# | A | 0 | L | 10 | | | 10 | |
# +----------+-------+----------+-------+-------+-------+-------+-------+
# | B | 9 | L | | 9 | | | |
# +----------+-------+----------+-------+-------+-------+-------+-------+
# | C | 0 | L | | | 10 | | |
# +----------+-------+----------+-------+-------+-------+-------+-------+
# | B | 1 | L | | | | | 11 |
# +----------+-------+----------+-------+-------+-------+-------+-------+
# | C | 5 | L | 15 | | | | |
# +----------+-------+----------+-------+-------+-------+-------+-------+
# | A | 4 | L | | 14 | | | |
# +----------+-------+----------+-------+-------+-------+-------+-------+
# | B | 6 | L | | | 16 | | |
# +----------+-------+----------+-------+-------+-------+-------+-------+
# | C | 4 | L | | | | 14 | |
# +----------+-------+----------+-------+-------+-------+-------+-------+
# | A | 6 | L | | | | | 16 |
# +----------+-------+----------+-------+-------+-------+-------+-------+
vs. with names_prefix
df |> DF.pivot_wider("weekday", "hour", names_prefix: "hour_") |> DF.print(limit: :infinity)
# +------------------------------------------------------------------------------------+
# | Explorer DataFrame: [rows: 9, columns: 8] |
# +----------+-------+----------+----------+----------+----------+----------+----------+
# | team | score | league | hour_Mon | hour_Tue | hour_Wed | hour_Thu | hour_Fri |
# | <string> | <s64> | <string> | <s64> | <s64> | <s64> | <s64> | <s64> |
# +==========+=======+==========+==========+==========+==========+==========+==========+
# | A | 0 | L | 10 | | | 10 | |
# +----------+-------+----------+----------+----------+----------+----------+----------+
# | B | 9 | L | | 9 | | | |
# +----------+-------+----------+----------+----------+----------+----------+----------+
# | C | 0 | L | | | 10 | | |
# +----------+-------+----------+----------+----------+----------+----------+----------+
# | B | 1 | L | | | | | 11 |
# +----------+-------+----------+----------+----------+----------+----------+----------+
# | C | 5 | L | 15 | | | | |
# +----------+-------+----------+----------+----------+----------+----------+----------+
# | A | 4 | L | | 14 | | | |
# +----------+-------+----------+----------+----------+----------+----------+----------+
# | B | 6 | L | | | 16 | | |
# +----------+-------+----------+----------+----------+----------+----------+----------+
# | C | 4 | L | | | | 14 | |
# +----------+-------+----------+----------+----------+----------+----------+----------+
# | A | 6 | L | | | | | 16 |
# +----------+-------+----------+----------+----------+----------+----------+----------+
I'm not totally opposed to a default that works like:
DF.pivot_wider(df, "a", "b", prefix_names: "a_")
# or
DF.pivot_wider(df, "a", "b", prefix_names: "a_b_")
# or
DF.pivot_wider(df, "a", "b", prefix_names: "b_when_a_is_")
Since, if present, you could always achieve the current default by doing:
DF.pivot_wider(df, "a", "b", prefix_names: "")
But this behavior doesn't seem to be the norm across other libraries.
names_prefix is ok, as it is, the problem happens after the names_prefix fragment.
the desired pattern is well visible when you iterate several pivot_wider in a row, at stage 1
weekday@Monday
should be weekday@hour_weekday_Monday
because that is the pattern for two successive pivot_wider in a row, fully revealed starting from 2 values_from
league@weekday@hour_weekday_Monday_league_L
league@weekday@score_weekday_Monday_league_L
the hour_weekday_value pattern, that is to say
originalvalueFromName_originalColumnName_originalColumnNameValue1
originalvalueFromName_originalColumnName_originalColumnNameValue2
etc..
does not seem to be in elixir, only the name prefix seems to be which is league@weekday@
in my examples
weekday@Monday
should beweekday@hour_weekday_Monday
If you really need them to match, I think you can do this:
DF.pivot_wider(df, "weekday", "hour", prefix_names: "weekday@hour_weekday_")
Should be straightforward to do dynamically as well:
a = "weekday"
b = "hour"
DF.pivot_wider(df, a, b, prefix_names: "#{a}@#{b}_#{a}_")
I cannot do it manually because it is already done automatically, but not consistently... moreover I don't know the values that are pulled from the row var and in what order they are used for the new names of columns
forget about the prefix compare the names generated when there are at least two vars that go in column and only one compare the names when there are at least two values_from and only one compare the names when the variable that goes in column has at least two modalities and only one
and you will see that it is not consistent and not easely predictable the simplest approach would be to generate the names consistently but it's maybe in the polar implementation
here is a resume without any resort to names_prefix which is not important for the consistency
in short to get consistent names, there is only one requirement: when values_from is only one column, put the name of the values_from column in the resultant column name (even if not necessary to remove ambiguity) and put the name of the variable column as prefix to variable value all the time
valueFromName_columnName_columnValue etc..
this requirement is automatic when there is at least two values from columns. Ok the end programmer could detect the one value column and tweak the names but it is tricky when the variable in column has only one modality and you don't know it and you iterate further other variables in column
so for sanity it would be better to fix it from the source the pivot_wider names generating algorithm
here are some illustrations
d1 original table
note that league has only one modality
+-----------------------------------------------------------+
| Explorer DataFrame: [rows: 2, columns: 6] |
+----------+----------+-------+-------+----------+----------+
| weekday | team | hour | score | league | id |
| <string> | <string> | <s64> | <s64> | <string> | <string> |
+==========+==========+=======+=======+==========+==========+
| Monday | A | 10 | 0 | L | row1 |
+----------+----------+-------+-------+----------+----------+
| Tuesday | B | 9 | 9 | L | row2 |
+----------+----------+-------+-------+----------+----------+
d2 = Explorer.DataFrame.pivot_wider(d1, "weekday", "hour")
d2 weekday in column / hour: one variable 2 modalities 1 value
here the names generated should be hour_weekday_Monday hour_weekday_Tuesday for consistency
+ -----------------------------------------------------------+
| Explorer DataFrame: [rows: 2, columns: 6] |
+----------+-------+----------+----------+--------+---------+
| team | score | league | id | Monday | Tuesday |
| <string> | <s64> | <string> | <string> | <s64> | <s64> |
+==========+=======+==========+==========+========+=========+
| A | 0 | L | row1 | 10 | |
+----------+-------+----------+----------+--------+---------+
| B | 9 | L | row2 | | 9 |
+----------+-------+----------+----------+--------+---------+
d3 league in column/ hour: one variable 1 modality 1 value
d3 = Explorer.DataFrame.pivot_wider(d1, "league", "hour")
same as above hour_league_L
so the problem is not in the number of modalities of the variable (same pattern as with weekday)
+------------------------------------------------+
| Explorer DataFrame: [rows: 2, columns: 5] |
+----------+----------+-------+----------+-------+
| weekday | team | score | id | L |
| <string> | <string> | <s64> | <string> | <s64> |
+==========+==========+=======+==========+=======+
| Monday | A | 0 | row1 | 10 |
+----------+----------+-------+----------+-------+
| Tuesday | B | 9 | row2 | 9 |
+----------+----------+-------+----------+-------+
d4 weakday in column / hour score: one variable 2 modalities 2 values
d4 = Explorer.DataFrame.pivot_wider(d1, "weekday", ["hour", "score"])
desired consistency, as soon as there are more than one value column
+----------------------------------------------------------------------------------------------------------------------------+
| Explorer DataFrame: [rows: 2, columns: 7] |
+----------+----------+----------+---------------------+----------------------+----------------------+-----------------------+
| team | league | id | hour_weekday_Monday | hour_weekday_Tuesday | score_weekday_Monday | score_weekday_Tuesday |
| <string> | <string> | <string> | <s64> | <s64> | <s64> | <s64> |
+==========+==========+==========+=====================+======================+======================+=======================+
| A | L | row1 | 10 | | 0 | |
+----------+----------+----------+---------------------+----------------------+----------------------+-----------------------+
| B | L | row2 | | 9 | | 9 |
+----------+----------+----------+---------------------+----------------------+----------------------+-----------------------+
d4b league in column / hour score: one variable 2 modalities 2 values
d4b = Explorer.DataFrame.pivot_wider(d1, "league", ["hour", "score"])
desired consistency, as soon as there are more than one value column
+-----------------------------------------------------------------+
| Explorer DataFrame: [rows: 2, columns: 5] |
+----------+----------+----------+---------------+----------------+
| weekday | team | id | hour_league_L | score_league_L |
| <string> | <string> | <string> | <s64> | <s64> |
+==========+==========+==========+===============+================+
| Monday | A | row1 | 10 | 0 |
+----------+----------+----------+---------------+----------------+
| Tuesday | B | row2 | 9 | 9 |
+----------+----------+----------+---------------+----------------+
Now what happens when you iterate pivot_wider
d5 one variable 1 modality, then another variable 2 modalities, 2 values
d5 = Explorer.DataFrame.pivot_wider(d3, "weekday", ["L"])
should be hour_league_L from d3 then hour_league_L_weekday_Monday etc...
+------------------------------------------------+
| Explorer DataFrame: [rows: 2, columns: 5] |
+----------+-------+----------+--------+---------+
| team | score | id | Monday | Tuesday |
| <string> | <s64> | <string> | <s64> | <s64> |
+==========+=======+==========+========+=========+
| A | 0 | row1 | 10 | |
+----------+-------+----------+--------+---------+
| B | 9 | row2 | | 9 |
+----------+-------+----------+--------+---------+
d6 one variable 2 modalities, then another variable 1 modality, many values
values_from = (d4.names -- d1.names) -- ["weekday"]
value_from = ["hour_weekday_Monday", "hour_weekday_Tuesday", "score_weekday_Monday",
"score_weekday_Tuesday"]
d6 = Explorer.DataFrame.pivot_wider(d4, "league", values_from)
values_from ok from d4 so it's ok because many values_from
+-----------------------------------------------------------------------------------------------------------------------------------------------------+
| Explorer DataFrame: [rows: 2, columns: 6] |
+----------+----------+------------------------------+-------------------------------+-------------------------------+--------------------------------+
| team | id | hour_weekday_Monday_league_L | hour_weekday_Tuesday_league_L | score_weekday_Monday_league_L | score_weekday_Tuesday_league_L |
| <string> | <string> | <s64> | <s64> | <s64> | <s64> |
+==========+==========+==============================+===============================+===============================+================================+
| A | row1 | 10 | | 0 | |
+----------+----------+------------------------------+-------------------------------+-------------------------------+--------------------------------+
| B | row2 | | 9 | | 9 |
+----------+----------+------------------------------+-------------------------------+-------------------------------+--------------------------------+
I think I better understand what you're saying. Your use case is that you have to infer the original column from the new column names?
That seems like it would be a little error prone. However, I think the team would consider a PR that changes what names_prefix
defaults to so that it's consistent for the single and multi-column case.
And I apologize but I still don't follow the part about what you think should happen when you call pivot_wider
multiple times. But perhaps some tests in the PR would help clarify.
Sorry for not being clear enough
when values from contains more than one variable (eg hour and score), all is ok, in the names generating process for weekday in column
hour_weekday_Monday
score_weekday_Monday
, etc...
what would be more consistent in my opinion is that even with a single variable in values_from you would have
hour_league_L
instead of only L
for a league variable with a single modality L and values_from hour
to pass several variables in column, you iterate, and put all new columns from previous step in the values_from list, which concatenates variable names
so starting from this
+-----------------------------------------------------------+
| Explorer DataFrame: [rows: 2, columns: 6] |
+----------+----------+-------+-------+----------+----------+
| weekday | team | hour | score | league | id |
| <string> | <string> | <s64> | <s64> | <string> | <string> |
+==========+==========+=======+=======+==========+==========+
| Monday | A | 10 | 0 | L | row1 |
+----------+----------+-------+-------+----------+----------+
| Tuesday | B | 9 | 9 | L | row2 |
+----------+----------+-------+-------+----------+----------+
league in column with values from hour would give names
weekday
team
score
id
hour_league_L
(instead of only L because there is only one variable)
then iteration with weekday in column with values from hour_league_L
would give names
team
score
id
hour_League_L_weekday_Monday
hour_League_L_weekday_Tuesday
(instead of only weekday_Monday ``weekday_Tuesday
currently because there is only one values_from variable)
you see that in the iteration process, if for some reason at one stage you get only a single values_from (because you ask a single one or because the previous step gave only one through a one modality variable) currently the names will be missing information.
what's more insidious, if you start with a two or more modality variable , since this gives many values_from variables, you will not get any problem if further in the iteration you get a one modality variable, because you already carry several values_from variables and it cannot decrease.
But if you start with a one modality variable, you get stucked unless you also start with many values_from variables.
the remedy for all this is:
generate the name indicator_colvar_colvalue1 indicator_colvar_colvalue2 ...
for pivot_wider(df, colvar, indicator) even in the case indicator is a single name and colvar has only one modality
the benefit is that you never loose track of what's in the table from the names, and for instance you can generate nice hierarchical html table of the dataframe instead of using the flat lengthy names
can you point the name generating process for pivot wider in the code ? it seems it's in polar.
def pivot_wider(df, id_columns, names_from, values_from, names_prefix) do
names_prefix_optional = unless names_prefix == "", do: names_prefix
Shared.apply(:df_pivot_wider, [
df.data,
id_columns,
names_from,
values_from,
names_prefix_optional
])
|> Shared.create_dataframe()
end
explorer/lib/explorer/polars_backend/shared.ex
def apply(fun, args \\ []) do
case apply(Native, fun, args) do
{:ok, value} -> value
{:error, error} -> raise runtime_error(error)
end
end
Ok, the iterated calls part was just another example of the inconsistency. Got it.
can you point the name generating process for pivot wider in the code ?
https://github.com/elixir-explorer/explorer/blob/main/native/explorer/src/dataframe.rs#L593-L648
Note that there's some post-processing done in our Rust code.
For what I can guess, not being familiar with rust, the name post processing in the Explorer Rust code is not at the root of the inconsistency one can see the names_prefix in action but precisely, the inconsistency happens independently of names_prefix
moreover from the example in the polar doc one can see the one value_column case giving bare names
Closing this, as per #967.
Hello,
with version 0.7.2, but I think there is no change in 0.8
when using pivot wider and the names_prefix option the names generated are not consistent in at least two situations : 1 - a row variable going to column with only one modality is not always found in the name 2 - an indicator variable alone in values_from list is not found in the generated name
this is a problem when you want to dynamically manipulate the names afterwards, for instance to generate a header with span.