data-apis / dataframe-api

RFC document, tooling and other content related to the dataframe API standard
https://data-apis.org/dataframe-api/draft/index.html
MIT License
99 stars 20 forks source link

Mutability #10

Open TomAugspurger opened 4 years ago

TomAugspurger commented 4 years ago

On the call yesterday, the topic of mutability came up in the vaex demo.

The short version is that it may be difficult or impossible for some systems to implement inplace mutation of dataframes. For example, I believe that neither vaex nor Dask implement the following:

In [8]: df = pd.DataFrame({"A": [1, 2]})

In [9]: df
Out[9]:
   A
0  1
1  2

In [10]: df.loc[0, 'A'] = 0

In [11]: df
Out[11]:
   A
0  0
1  2

I think in the name of simplicity, the API standard should just not define any methods that mutate existing data inplace.

There is one mutation-adjacent area that might be considered: using DataFrame.__setitem__ to add an additional column

In [12]: df['B'] = [1, 2]

In [13]: df
Out[13]:
   A  B
0  0  1
1  2  2

Or perhaps to update the contents of an entire column

In [14]: df['B'] = [3, 4]

In [15]: df
Out[15]:
   A  B
0  0  3
1  2  4

In these case, no values are actually being mutated inplace. Is that acceptable?

devin-petersohn commented 4 years ago

I think in the name of simplicity, the API standard should just not define any methods that mutate existing data inplace.

I agree completely. It is possible to simulate inplace behavior for legacy systems, but moving forward we should recommend users to make their code more functional.

There is one mutation-adjacent area that might be considered: using DataFrame.setitem to add an additional column

Not to derail, but I would argue against __setitem__ in the API spec (more discussion in #6). Adding a column should be something like insert and updating a column should be perhaps update. If we do not support inplace mutations, we should do what we can in the API spec to avoid giving the idea that things are updating inplace.

TomAugspurger commented 4 years ago

Sure, let's avoid the __setitem__ discussion for now on only focus on inplace expansion of columns.

I didn't state it explicitly, but I think inplace expansion of rows is out of scope, given the columnar focus.

saulshanabrook commented 4 years ago

I think in the name of simplicity, the API standard should just not define any methods that mutate existing data inplace.

One possible issue with this approach, is that if the DataFrame API uses the Array API as it's column interface, then this could also mean that arrays must be immutable, which would mean that this kind of existing imperative code in sklearn would not work with the standard.

It seems like there are at least a couple of different options here, assuming that we do want to keep supporting an imperative array API for things like sklearn to use and a functional dataframe API for things like Modin or Vaex to implement.

  1. Keep dataframes as functional and arrays as imperative. "An API for going from dataframe API standard -> array API standard might solve this issue." from @devin-petersohn
  2. Have two layers/levels for each API. A higher level/messier version that allows mutation and non functional operations. A lower level one, where all mutation is functional in nature instead. The high level dataframe would return a high level array, that allows mutation. The low level dataframe, would return the low level array. This is like Travis said " One that is more dev-centric and another that is more user-centric. For clarity, and now that packaging is in a better situation, two libraries are often more effective than one."
  3. Instead of separating into high/low level, have separate mutable and immutable APIs. This would be like the how abc.collections has Mutable versions of some options.
devin-petersohn commented 4 years ago

@TomAugspurger agree. Taking it a step further, I think any inplace related updates should not be supported. Expansion/destruction of rows/columns, inplace point mutations, etc.

Back to your original question:

In these case, no values are actually being mutated inplace. Is that acceptable?

I believe that all APIs should result in a new dataframe, so I would argue that we don't want an API for creating a new column on an existing dataframe, modifying the data underlying an existing dataframe object. (Implementations are free to have this individually, but it will be outside the spec)

Often users chain operations together. Ideally every operation can be chained with another in the API. Example from pandas:

df.dropna(inplace=True)
df.groupby("c").count()

vs

df.dropna().groupby("c").count()

For general use, it is better to not make the user choose between inplace vs not. The bottom example has the nice benefit of being easier to debug.

I am generally against the idea of tightly coupling the array and dataframe APIs.

Is there a reason, other than trying to optimize for performance or memory, to make anything inplace?

TomAugspurger commented 4 years ago

I don't think performance should be a primary concern here, just matching (hypothetical) user expectations that something like df['A'] = ... would work.

I personally am completely fine leaving that out of the initial version, in favor of something that's easily supported across implementations like df.assign(A=...).

On Fri, May 29, 2020 at 10:45 AM Devin Petersohn notifications@github.com wrote:

@TomAugspurger https://github.com/TomAugspurger agree. Taking it a step further, I think any inplace related updates should not be supported. Expansion/destruction of rows/columns, inplace point mutations, etc.

Back to your original question:

In these case, no values are actually being mutated inplace. Is that acceptable?

I believe that all APIs should result in a new dataframe, so I would argue that we don't want an API for creating a new column on an existing dataframe, modifying the data underlying an existing dataframe object. (Implementations are free to have this individually, but it will be outside the spec)

Often users chain operations together. Ideally every operation can be chained with another in the API. Example from pandas:

df.dropna(inplace=True)df.groupby("c").count()

vs

df.dropna().groupby("c").count()

For general use, it is better to not make the user choose between inplace vs not. The bottom example has the nice benefit of being easier to debug.

I am generally against the idea of tightly coupling the array and dataframe APIs.

Is there a reason, other than trying to optimize for performance or memory, to make anything inplace?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata-apis/dataframe-api/issues/10#issuecomment-636044292, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIXHHHNDMQ2PMIEGCELRT7KBVANCNFSM4NOBRAAQ .

TomAugspurger commented 4 years ago

Apologies for missing this @saulshanabrook

One possible issue with this approach, is that if the DataFrame API uses the Array API as it's column interface, then this could also mean that arrays must be immutable,

I'm still thinking this through... I'm having trouble thinking through the implications though. I'll give it some more thought.


In general, it sounds like people prefer to avoid mutation of existing data, whether that's mutating individual values inplace or expanding an existing dataframe by adding new rows / columns. So I'll propose that we explicitly leave the following (pandas-style) APIs out of the spec

With the option that there might be a mutable level / extension to the standard, once we figure how to handle the various levels.

datapythonista commented 4 years ago

Let me try to summarize the discussion, and see if there is agreement to make some decisions.

(pandas.read_csv('countries.csv') .rename(columns={'name': 'country'}) .assign(area_km2=lambda df: df['area_m2'].astype(float) / 100) .query('(continent.str.lower() != "antarctica") | (population < area_km2)') .groupby('continent') [['population', 'area_km2']] .sum() .assign(density=lambda df: df['population'] / df['area_km2']) [['density']] .to_parquet('continent_density.parquet', index=True))


I think most operations are quite simple and non-controversial (feel free to disagree, of course). I think the main changes (from pandas, to the standard API) would be:
- Remove the syntax that is not consistent with this behaviour
- Consider special cases, mainly `__getitem__` (do we want to replace it with `.select()` and `.filter()` instead?)
- When creating new columns and in conditions, how to refer to the columns. Is using `lambda` as pandas does the best way? Or are there alternatives?
- Obviously the specific methods to implement and their signature is open to discussion

Is there agreement in the above points, or are there things that should be discussed further?