elixir-explorer / explorer

Series (one-dimensional) and dataframes (two-dimensional) for fast and elegant data exploration in Elixir
https://hexdocs.pm/explorer
MIT License
1.12k stars 123 forks source link

Add access example to DataFrame docs (#1001) #1004

Closed mooreryan closed 4 weeks ago

mooreryan commented 1 month ago

To start, I added one example that shows what happens when trying to access a data frame column that does not exist.

I'm not sure how much to go into some of the comments in #1001, e.g., the fact that an error is raised instead of returning nil and other aspects of the Access behaviour that might be surprising.

What do you think, should it be kept simple like this or should I add something specifically about Access behaviour?

mooreryan commented 1 month ago

What other aspects of the Access implementation did you maybe want to highlight?

Mainly some of the aspects from this discussion: https://elixirforum.com/t/access-behaviour-for-explorer-dataframe/66691. (Though, looking at the username, I think you might already be in that discussion!)

To summarize that thread, I was surprised by the fact that DataFrame had a different Access behavior than what the docs for the Access behavior would imply. E.g., raising rather than returning nil, fetch raising rather than returning :error, Access.fetch! raising ArgumentError rather than KeyError, etc.

I wasn't sure if these doc change was the spot to get into all of that though.

billylanchantin commented 1 month ago

Yep that's me!

I wanted to clarify because some points (like not needing to implement fetch!) were cleared up in the post but others weren't. I figured it'd be a good idea to make a fresh list all the stuff we might want to document here.

If we do want to include more than what's ["not_a_column"] raising vs. returning nil, I think I'd prefer to include a brief overview of the reasoning behind the design choices. Something like:

### Missing columns

Explorer will generally raise an `ArgumentError` if the argument to `[]` is not
found, but not always. For example:

...your example...

This deviates from other `Access` implementations which return `nil` when the
key is not found. We raise early to help users catch errors quicker since most
DataFrame operations could not meaningfully continue if the result was `nil`.

We also raise an `ArgumentError` rather than a `KeyError`. This is because `[]`
is designed to take a wider collection of arguments than a single key, e.g.
lists and regexes.

However, note that some `[]` calls will not raise if their argument is not
found. This occurs when argument to `[]` is more of a search criteria like
a regex:

    iex> df = Explorer.DataFrame.new(col_1: [1], col_2: [2], other: ["a"])
    iex> col_n_regex = ~r/^col_\d$/
    iex> df[col_n_regex]
    #Explorer.DataFrame<
      Polars[1 x 2]
      col_1 s64 [1]
      col_2 s64 [2]
    >
    iex> nothing_will_match_regex = ~r/not_a_pattern/
    iex> df[nothing_will_match_regex]
    #Explorer.DataFrame<
      Polars[0 x 0]
    >

I'm not married to any of that language. But explaining the "why" might be nice (if the team agrees!).

mooreryan commented 1 month ago

But explaining the "why" might be nice

I like the idea of adding some clarification of the reasoning of why things were designed this way. For me, it was pretty surprising to see the same action (trying to access a column) have different result depending on the argument, (sometimes raising, sometimes returning an empty dataframe, etc.).

(note: this got a bit rambling, there is a summary at the bottom)

More examples

To be honest, I don't understand the reasoning behind sometimes raising and sometimes returning an empty dataframe. E.g., for this dataframe:

iex(1)> df = DataFrame.new(a: [1, 2])
#Explorer.DataFrame<
  Polars[2 x 1]
  a s64 [1, 2]
>

This raises:

iex(2)> df["b"]
** (ArgumentError) could not find column name "b". The available columns are: ["a"].
If you are attempting to interpolate a value, use ^b.

    (explorer 0.9.2) lib/explorer/shared.ex:252: Explorer.Shared.maybe_raise_column_not_found/3
    (explorer 0.9.2) lib/explorer/shared.ex:187: anonymous fn/4 in Explorer.Shared.to_existing_columns/3
    (elixir 1.17.2) lib/enum.ex:1829: Enum."-map_reduce/3-lists^mapfoldl/2-0-"/3
    (explorer 0.9.2) lib/explorer/shared.ex:175: Explorer.Shared.to_existing_columns/3
    (explorer 0.9.2) lib/explorer/data_frame.ex:3984: Explorer.DataFrame.pull/2
    (explorer 0.9.2) lib/explorer/data_frame.ex:373: Explorer.DataFrame.fetch/2
    (elixir 1.17.2) lib/access.ex:322: Access.get/3
    iex:2: (file)

But this returns an empty data frame:

iex(2)> df[~r/b/]
#Explorer.DataFrame<
  Polars[0 x 0]
>

Asking for the 2nd column raises:

iex(3)> df[1]
** (ArgumentError) no column exists at index 1
    (explorer 0.9.2) lib/explorer/shared.ex:240: Explorer.Shared.fetch_column_at!/2
    (explorer 0.9.2) lib/explorer/shared.ex:178: anonymous fn/4 in Explorer.Shared.to_existing_columns/3
    (elixir 1.17.2) lib/enum.ex:1829: Enum."-map_reduce/3-lists^mapfoldl/2-0-"/3
    (explorer 0.9.2) lib/explorer/shared.ex:175: Explorer.Shared.to_existing_columns/3
    (explorer 0.9.2) lib/explorer/data_frame.ex:3984: Explorer.DataFrame.pull/2
    (explorer 0.9.2) lib/explorer/data_frame.ex:373: Explorer.DataFrame.fetch/2
    (elixir 1.17.2) lib/access.ex:322: Access.get/3
    iex:3: (file)

But this returns a data frame with just column a even though I'm asking for the first both the 1st and 2nd columns:

iex(3)> df[0..1]
#Explorer.DataFrame<
  Polars[2 x 1]
  a s64 [1, 2]
>

Comparing to R's tidyverse

I'm not making a value judgement on this, rather trying to explain what I found confusing. Consider this R tidyverse example. Here is a data frame.

 > df <- tibble(a = 1:2)
> df
# A tibble: 2 × 1
      a
  <int>
1     1
2     2

And here attempting to access a column that doesn't exist:

 > df["b"]
Error in `df["b"]`:
! Can't subset columns that don't exist.
✖ Column `b` doesn't exist.

That gives an error similar to Explorer. And the result of selecting columns from a data frame with regex also behaves like Explorer, by giving an "empty" dataframe.

> df |> select(matches("b"))
# A tibble: 2 × 0

Same behavior when accessing a column by index when that index is out of bounds (R using 1-based indexing):

> df[2]
Error in `df[2]`:
! Can't subset columns past the end.
ℹ Location 2 doesn't exist.
ℹ There is only 1 column.

But doing the equivalent of the df[0..1] example from Explorer:

> df[1:2]
Error in `df[1:2]`:
! Can't subset columns past the end.
ℹ Location 2 doesn't exist.
ℹ There is only 1 column.

That raises an error in R, whereas Explorer gives the data frame with just column a returned.

Summary

Explorer's access behavior is surprising to me in some ways. Not making a judgement on how it should be, since I don't have enough experience with Explorer, just trying to think of ways that docs could make it clearer.

I think some of this stuff is out of the scope for this small PR though. In fact, it would probably make sense as part of a cheat sheet or something like Explorer for Tidyverse users. I may have a go at putting something like that together, but it can be a big task.

(There was a discussion here about cheat sheets too.)

billylanchantin commented 1 month ago

To summarize and expand a little, here are your uses in the three relevant libraries:

Use Explorer Tidy Polars
Specific column df["b"] df["b"] df["b"]
String match df[~r/b/] select(df, matches("b")) df.select(cs.matches("b"))
Range past end df[0..1] df[1:2] df[0:2]

And here are their outputs:

Use Explorer Tidy Polars Agreement?
Specific column ArgumentError Error ColumnNotFoundError ✅ Yes
String match DataFrame DataFrame DataFrame ✅ Yes
Range past end DataFrame Error DataFrame ❌ No

So there's only a mismatch on the range one if I'm following. And I agree it's a bit borderline (and admit I was surprised ranges didn't raise). But when designing the API, the Explorer team is balancing a few tensions: what makes sense in Elixir, what data scientists are used to, what's feasible to implement today via Polars, etc. In this case, though I can't say for sure, I think we just defaulted to how Polars handles ranges rather than the tidyverse. Both seem somewhat reasonable to me.


But regardless of the reasoning behind individual design decisions, clearer/better docs are almost always the right call.

And as for the cheatsheet stuff, it would be appreciated! Even a start that can be improved upon later.

josevalim commented 1 month ago

I think we should raise for ranges out of bounds, for consistency within DataFrames and also with Nx.

billylanchantin commented 1 month ago

Issue for raising with ranges: https://github.com/elixir-explorer/explorer/issues/1005

Thanks for pointing it out, @mooreryan!

mooreryan commented 1 month ago

But when designing the API, the Explorer team is balancing a few tensions: what makes sense in Elixir, what data scientists are used to, what's feasible to implement today via Polars, etc. ... But regardless of the reasoning behind individual design decisions, clearer/better docs are almost always the right call.

This sort of info would be pretty cool to have in docs somewhere--I'm just not sure where they would ideally go.

billylanchantin commented 1 month ago

Yeah I'm not sure. Maybe our contributor's guidelines?

josevalim commented 4 weeks ago

:green_heart: :blue_heart: :purple_heart: :yellow_heart: :heart: