data-apis / dataframe-api

RFC document, tooling and other content related to the dataframe API standard
https://data-apis.org/dataframe-api/draft/index.html
MIT License
102 stars 20 forks source link

Signature for a standard `from_dataframe` constructor function #42

Open rgommers opened 3 years ago

rgommers commented 3 years ago

One of the "to be decided" items at https://github.com/data-apis/dataframe-api/blob/dataframe-interchange-protocol/protocol/dataframe_protocol_summary.md#to-be-decided is:

_Should there be a standard from_dataframe constructor function? This isn't completely necessary, however it's expected that a full dataframe API standard will have such a function. The array API standard also has such a function, namely from_dlpack. Adding at least a recommendation on syntax for this function would make sense, e.g., fromdataframe(df, stream=None). Discussion at https://github.com/data-apis/dataframe-api/issues/29#issuecomment-685903651 is relevant.

In the announcement blog post draft I tentatively answered that with "yes", and added an example. The question is what the desired signature should be. The Pandas prototype currently has the most basic signature one can think of:

def from_dataframe(df : DataFrameObject) -> pd.DataFrame:
    """
    Construct a pandas DataFrame from ``df`` if it supports ``__dataframe__``
    """
    if isinstance(df, pd.DataFrame):
        return df

    if not hasattr(df, '__dataframe__'):
        raise ValueError("`df` does not support __dataframe__")

    return _from_dataframe(df.__dataframe__())

The above just takes any dataframe supporting the protocol, and turns the whole things in the "library-native" dataframe. Now of course, it's possible to add functionality to it, to extract only a subset of the data. Most obviously, named columns:

def from_dataframe(df : DataFrameObject, *, colnames : Optional[Iterable[str]]= None) -> pd.DataFrame:

Other things we may or may not want to support:

My personal feeling is:

Thoughts?

rgommers commented 3 years ago

There was a little bit of hesitation about adding this function to a public API. For the initial I'd suggest adding in phrasing along these lines:

rgommers commented 2 years ago

This would be nice to revisit, before everyone makes up their own thing in a different namespace in their library. Like this:

>>> import pandas as pd
>>> pd.__version__
'1.5.0rc0'
>>> [name for name in dir(pd.api.interchange) if not name.startswith('_')]
['DataFrame', 'from_dataframe']

>>> pd.api.interchange.from_dataframe?
Signature: pd.core.interchange.from_dataframe.from_dataframe(df, allow_copy=True) -> 'pd.DataFrame'

See https://pandas.pydata.org/docs/dev/reference/api/pandas.api.interchange.from_dataframe.html

jorisvandenbossche commented 2 years ago

Do you want to standardize the signature, or also the namespace / location in the library?

rgommers commented 2 years ago

Good point. I think those are separate questions. Signature is more important I'd say. Namespace is only important once we have a concept of a "dataframe API standard namespace" - so that can be ignored for the purpose of this issue.

rgommers commented 2 years ago

Pandas code and signature:

def from_dataframe(df, allow_copy=True) -> pd.DataFrame:

Vaex code and signature:

def from_dataframe_to_vaex(df: DataFrameObject, allow_copy: bool = True) -> vaex.dataframe.DataFrame:

Modin code for function and code for method and signature:

def from_dataframe(df):

class PandasDataframe:
    def from_dataframe(cls, df: "ProtocolDataframe") -> "PandasDataframe":

cuDF code and signature:


def from_dataframe(df, allow_copy=False):

I found the explanation for allow_copy deviations in some older meeting notes:

_@maartenbreddels: if allow_copy or allow_memory_copy, then clearer to me. I am more in favor of allow_copy being False and thus being safe (performance-wise, and that I don't accidentally crash my computer)._

@jorisvandenbossche: an example would be string columns in pandas. Currently, in pandas, we cannot support arrow string columns, where two buffers. In the future, pandas will use arrow, but right now uses NumPy's object dtype. So atm, pandas would require a copy, so would always raise an exception.

Based on the above, I think we can explicitly state that allow_copy can have any default, and that libraries must add an allow_copy keyword.

rgommers commented 2 years ago

The summary of a discussion on this yesterday was: