Closed mooreryan closed 4 weeks ago
What other aspects of the Access implementation did you maybe want to highlight?
Mainly some of the aspects from this discussion: https://elixirforum.com/t/access-behaviour-for-explorer-dataframe/66691. (Though, looking at the username, I think you might already be in that discussion!)
To summarize that thread, I was surprised by the fact that DataFrame had a different Access
behavior than what the docs for the Access
behavior would imply. E.g., raising rather than returning nil
, fetch
raising rather than returning :error
, Access.fetch!
raising ArgumentError
rather than KeyError
, etc.
I wasn't sure if these doc change was the spot to get into all of that though.
Yep that's me!
I wanted to clarify because some points (like not needing to implement fetch!
) were cleared up in the post but others weren't. I figured it'd be a good idea to make a fresh list all the stuff we might want to document here.
If we do want to include more than what's ["not_a_column"]
raising vs. returning nil
, I think I'd prefer to include a brief overview of the reasoning behind the design choices. Something like:
### Missing columns
Explorer will generally raise an `ArgumentError` if the argument to `[]` is not
found, but not always. For example:
...your example...
This deviates from other `Access` implementations which return `nil` when the
key is not found. We raise early to help users catch errors quicker since most
DataFrame operations could not meaningfully continue if the result was `nil`.
We also raise an `ArgumentError` rather than a `KeyError`. This is because `[]`
is designed to take a wider collection of arguments than a single key, e.g.
lists and regexes.
However, note that some `[]` calls will not raise if their argument is not
found. This occurs when argument to `[]` is more of a search criteria like
a regex:
iex> df = Explorer.DataFrame.new(col_1: [1], col_2: [2], other: ["a"])
iex> col_n_regex = ~r/^col_\d$/
iex> df[col_n_regex]
#Explorer.DataFrame<
Polars[1 x 2]
col_1 s64 [1]
col_2 s64 [2]
>
iex> nothing_will_match_regex = ~r/not_a_pattern/
iex> df[nothing_will_match_regex]
#Explorer.DataFrame<
Polars[0 x 0]
>
I'm not married to any of that language. But explaining the "why" might be nice (if the team agrees!).
But explaining the "why" might be nice
I like the idea of adding some clarification of the reasoning of why things were designed this way. For me, it was pretty surprising to see the same action (trying to access a column) have different result depending on the argument, (sometimes raising, sometimes returning an empty dataframe, etc.).
(note: this got a bit rambling, there is a summary at the bottom)
To be honest, I don't understand the reasoning behind sometimes raising and sometimes returning an empty dataframe. E.g., for this dataframe:
iex(1)> df = DataFrame.new(a: [1, 2])
#Explorer.DataFrame<
Polars[2 x 1]
a s64 [1, 2]
>
This raises:
iex(2)> df["b"]
** (ArgumentError) could not find column name "b". The available columns are: ["a"].
If you are attempting to interpolate a value, use ^b.
(explorer 0.9.2) lib/explorer/shared.ex:252: Explorer.Shared.maybe_raise_column_not_found/3
(explorer 0.9.2) lib/explorer/shared.ex:187: anonymous fn/4 in Explorer.Shared.to_existing_columns/3
(elixir 1.17.2) lib/enum.ex:1829: Enum."-map_reduce/3-lists^mapfoldl/2-0-"/3
(explorer 0.9.2) lib/explorer/shared.ex:175: Explorer.Shared.to_existing_columns/3
(explorer 0.9.2) lib/explorer/data_frame.ex:3984: Explorer.DataFrame.pull/2
(explorer 0.9.2) lib/explorer/data_frame.ex:373: Explorer.DataFrame.fetch/2
(elixir 1.17.2) lib/access.ex:322: Access.get/3
iex:2: (file)
But this returns an empty data frame:
iex(2)> df[~r/b/]
#Explorer.DataFrame<
Polars[0 x 0]
>
Asking for the 2nd column raises:
iex(3)> df[1]
** (ArgumentError) no column exists at index 1
(explorer 0.9.2) lib/explorer/shared.ex:240: Explorer.Shared.fetch_column_at!/2
(explorer 0.9.2) lib/explorer/shared.ex:178: anonymous fn/4 in Explorer.Shared.to_existing_columns/3
(elixir 1.17.2) lib/enum.ex:1829: Enum."-map_reduce/3-lists^mapfoldl/2-0-"/3
(explorer 0.9.2) lib/explorer/shared.ex:175: Explorer.Shared.to_existing_columns/3
(explorer 0.9.2) lib/explorer/data_frame.ex:3984: Explorer.DataFrame.pull/2
(explorer 0.9.2) lib/explorer/data_frame.ex:373: Explorer.DataFrame.fetch/2
(elixir 1.17.2) lib/access.ex:322: Access.get/3
iex:3: (file)
But this returns a data frame with just column a
even though I'm asking for the first both the 1st and 2nd columns:
iex(3)> df[0..1]
#Explorer.DataFrame<
Polars[2 x 1]
a s64 [1, 2]
>
I'm not making a value judgement on this, rather trying to explain what I found confusing. Consider this R tidyverse example. Here is a data frame.
> df <- tibble(a = 1:2)
> df
# A tibble: 2 × 1
a
<int>
1 1
2 2
And here attempting to access a column that doesn't exist:
> df["b"]
Error in `df["b"]`:
! Can't subset columns that don't exist.
✖ Column `b` doesn't exist.
That gives an error similar to Explorer. And the result of selecting columns from a data frame with regex also behaves like Explorer, by giving an "empty" dataframe.
> df |> select(matches("b"))
# A tibble: 2 × 0
Same behavior when accessing a column by index when that index is out of bounds (R using 1-based indexing):
> df[2]
Error in `df[2]`:
! Can't subset columns past the end.
ℹ Location 2 doesn't exist.
ℹ There is only 1 column.
But doing the equivalent of the df[0..1]
example from Explorer:
> df[1:2]
Error in `df[1:2]`:
! Can't subset columns past the end.
ℹ Location 2 doesn't exist.
ℹ There is only 1 column.
That raises an error in R, whereas Explorer gives the data frame with just column a
returned.
Explorer's access behavior is surprising to me in some ways. Not making a judgement on how it should be, since I don't have enough experience with Explorer, just trying to think of ways that docs could make it clearer.
I think some of this stuff is out of the scope for this small PR though. In fact, it would probably make sense as part of a cheat sheet or something like Explorer for Tidyverse users. I may have a go at putting something like that together, but it can be a big task.
(There was a discussion here about cheat sheets too.)
To summarize and expand a little, here are your uses in the three relevant libraries:
Use | Explorer | Tidy | Polars |
---|---|---|---|
Specific column | df["b"] |
df["b"] |
df["b"] |
String match | df[~r/b/] |
select(df, matches("b")) |
df.select(cs.matches("b")) |
Range past end | df[0..1] |
df[1:2] |
df[0:2] |
And here are their outputs:
Use | Explorer | Tidy | Polars | Agreement? |
---|---|---|---|---|
Specific column | ArgumentError |
Error |
ColumnNotFoundError |
✅ Yes |
String match | DataFrame | DataFrame | DataFrame | ✅ Yes |
Range past end | DataFrame | Error |
DataFrame | ❌ No |
So there's only a mismatch on the range one if I'm following. And I agree it's a bit borderline (and admit I was surprised ranges didn't raise). But when designing the API, the Explorer team is balancing a few tensions: what makes sense in Elixir, what data scientists are used to, what's feasible to implement today via Polars, etc. In this case, though I can't say for sure, I think we just defaulted to how Polars handles ranges rather than the tidyverse. Both seem somewhat reasonable to me.
But regardless of the reasoning behind individual design decisions, clearer/better docs are almost always the right call.
And as for the cheatsheet stuff, it would be appreciated! Even a start that can be improved upon later.
I think we should raise for ranges out of bounds, for consistency within DataFrames and also with Nx.
Issue for raising with ranges: https://github.com/elixir-explorer/explorer/issues/1005
Thanks for pointing it out, @mooreryan!
But when designing the API, the Explorer team is balancing a few tensions: what makes sense in Elixir, what data scientists are used to, what's feasible to implement today via Polars, etc. ... But regardless of the reasoning behind individual design decisions, clearer/better docs are almost always the right call.
This sort of info would be pretty cool to have in docs somewhere--I'm just not sure where they would ideally go.
Yeah I'm not sure. Maybe our contributor's guidelines?
:green_heart: :blue_heart: :purple_heart: :yellow_heart: :heart:
To start, I added one example that shows what happens when trying to access a data frame column that does not exist.
I'm not sure how much to go into some of the comments in #1001, e.g., the fact that an error is raised instead of returning
nil
and other aspects of theAccess
behaviour that might be surprising.What do you think, should it be kept simple like this or should I add something specifically about Access behaviour?