Open TomAugspurger opened 4 years ago
I think in the name of simplicity, the API standard should just not define any methods that mutate existing data inplace.
I agree completely. It is possible to simulate inplace
behavior for legacy systems, but moving forward we should recommend users to make their code more functional.
There is one mutation-adjacent area that might be considered: using DataFrame.setitem to add an additional column
Not to derail, but I would argue against __setitem__
in the API spec (more discussion in #6). Adding a column should be something like insert
and updating a column should be perhaps update
. If we do not support inplace mutations, we should do what we can in the API spec to avoid giving the idea that things are updating inplace.
Sure, let's avoid the __setitem__
discussion for now on only focus on inplace expansion of columns.
I didn't state it explicitly, but I think inplace expansion of rows is out of scope, given the columnar focus.
I think in the name of simplicity, the API standard should just not define any methods that mutate existing data inplace.
One possible issue with this approach, is that if the DataFrame API uses the Array API as it's column interface, then this could also mean that arrays must be immutable, which would mean that this kind of existing imperative code in sklearn would not work with the standard.
It seems like there are at least a couple of different options here, assuming that we do want to keep supporting an imperative array API for things like sklearn to use and a functional dataframe API for things like Modin or Vaex to implement.
abc.collections
has Mutable versions of some options.@TomAugspurger agree. Taking it a step further, I think any inplace related updates should not be supported. Expansion/destruction of rows/columns, inplace point mutations, etc.
Back to your original question:
In these case, no values are actually being mutated inplace. Is that acceptable?
I believe that all APIs should result in a new dataframe, so I would argue that we don't want an API for creating a new column on an existing dataframe, modifying the data underlying an existing dataframe object. (Implementations are free to have this individually, but it will be outside the spec)
Often users chain operations together. Ideally every operation can be chained with another in the API. Example from pandas:
df.dropna(inplace=True)
df.groupby("c").count()
vs
df.dropna().groupby("c").count()
For general use, it is better to not make the user choose between inplace vs not. The bottom example has the nice benefit of being easier to debug.
I am generally against the idea of tightly coupling the array and dataframe APIs.
Is there a reason, other than trying to optimize for performance or memory, to make anything inplace?
I don't think performance should be a primary concern here, just matching
(hypothetical) user expectations that something like df['A'] = ...
would
work.
I personally am completely fine leaving that out of the initial version, in
favor of something that's easily supported across implementations like
df.assign(A=...)
.
On Fri, May 29, 2020 at 10:45 AM Devin Petersohn notifications@github.com wrote:
@TomAugspurger https://github.com/TomAugspurger agree. Taking it a step further, I think any inplace related updates should not be supported. Expansion/destruction of rows/columns, inplace point mutations, etc.
Back to your original question:
In these case, no values are actually being mutated inplace. Is that acceptable?
I believe that all APIs should result in a new dataframe, so I would argue that we don't want an API for creating a new column on an existing dataframe, modifying the data underlying an existing dataframe object. (Implementations are free to have this individually, but it will be outside the spec)
Often users chain operations together. Ideally every operation can be chained with another in the API. Example from pandas:
df.dropna(inplace=True)df.groupby("c").count()
vs
df.dropna().groupby("c").count()
For general use, it is better to not make the user choose between inplace vs not. The bottom example has the nice benefit of being easier to debug.
I am generally against the idea of tightly coupling the array and dataframe APIs.
Is there a reason, other than trying to optimize for performance or memory, to make anything inplace?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata-apis/dataframe-api/issues/10#issuecomment-636044292, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIXHHHNDMQ2PMIEGCELRT7KBVANCNFSM4NOBRAAQ .
Apologies for missing this @saulshanabrook
One possible issue with this approach, is that if the DataFrame API uses the Array API as it's column interface, then this could also mean that arrays must be immutable,
I'm still thinking this through... I'm having trouble thinking through the implications though. I'll give it some more thought.
In general, it sounds like people prefer to avoid mutation of existing data, whether that's mutating individual values inplace or expanding an existing dataframe by adding new rows / columns. So I'll propose that we explicitly leave the following (pandas-style) APIs out of the spec
df.iloc[0, 0] = 0
-- mutate an existing value inplace with __setitem__
df.method(inplace=True)
-- Mutate an existing dataframe inplace with a methoddf['A'] = [1, 2, 3]
-- Expand a dataframe columnwisedf.append([1, 2])
-- Expand a dataframe rowwiseWith the option that there might be a mutable level / extension to the standard, once we figure how to handle the various levels.
Let me try to summarize the discussion, and see if there is agreement to make some decisions.
We don't want the API to provide ways to explicitly mutate information in place. My understanding is that we're talking about the API syntax, and it would still be possible for implementations to make operations in place if they want to. For example:
df2 = df1.fillna(0.)
This doesn't force the implementation to operate on df1
, but doesn't prevent that df1
is mutated and df2
is just a reference to df1
. Also, the syntax it doesn't prevent that df1
and df2
are lazy objects and there isn't memory allocated really. So, for now, we want a syntax that allows copies in all operations, and we leave for later (or out of scope) what's really going on.
We want that the syntax is consistenly returning dataframes, so methods can be chained.
If we want to build a friendlier API (as discussed here) we we'll leave that for later (or out of scope). For now let's focus on a developer (more low-level) API as agreed in #18.
pandas mostly implements an API that allows method chaining. With an example:
import pandas
(pandas.read_csv('countries.csv') .rename(columns={'name': 'country'}) .assign(area_km2=lambda df: df['area_m2'].astype(float) / 100) .query('(continent.str.lower() != "antarctica") | (population < area_km2)') .groupby('continent') [['population', 'area_km2']] .sum() .assign(density=lambda df: df['population'] / df['area_km2']) [['density']] .to_parquet('continent_density.parquet', index=True))
I think most operations are quite simple and non-controversial (feel free to disagree, of course). I think the main changes (from pandas, to the standard API) would be:
- Remove the syntax that is not consistent with this behaviour
- Consider special cases, mainly `__getitem__` (do we want to replace it with `.select()` and `.filter()` instead?)
- When creating new columns and in conditions, how to refer to the columns. Is using `lambda` as pandas does the best way? Or are there alternatives?
- Obviously the specific methods to implement and their signature is open to discussion
Is there agreement in the above points, or are there things that should be discussed further?
On the call yesterday, the topic of mutability came up in the vaex demo.
The short version is that it may be difficult or impossible for some systems to implement inplace mutation of dataframes. For example, I believe that neither vaex nor Dask implement the following:
I think in the name of simplicity, the API standard should just not define any methods that mutate existing data inplace.
There is one mutation-adjacent area that might be considered: using
DataFrame.__setitem__
to add an additional columnOr perhaps to update the contents of an entire column
In these case, no values are actually being mutated inplace. Is that acceptable?