potentially relevant usage patterns / targets for a developer-focused API

rgommers commented 2 years ago

In other issues we find some detailed analyses of how the pandas API is used today, e.g. gh-3 (on Kaggle notebooks) and in https://github.com/data-apis/python-record-api/tree/master/data/api (for a set of well-known packages). That data is either not relevant for a developer-focused API though, or is so detailed that it's hard to get a good feel for what's important. So I thought it'd be useful to revisit the topic. I used https://libraries.io/pypi/pandas and looked at some of the top repos that declare a dependency on pandas.

Top 10 listed:

Seaborn

Perhaps the most interesting pandas usage. It's a hard dependency, is used a fair amount and for more than just data access, however it all still seems fairly standard and common so may be a reasonable target to make work with multiple libraries. Uses a lot of isinstance checks (on pd.DataFrame, pd.Series).

seaborn/_core.py: Series, to_numeric
seaborn/matrix.py: DataFrame, isnull, .index.equals, .column.equals,
seaborn/utils.py: DataFrame, Categorical, notnull
seaborn/regression.py: only pd.notnull
seaborn/distributions.py: .values, .copy, .iloc, .loc, .reset_index, .index, set_index, MultiIndex.from_arrays, Index, Series, concat, merge
seaborn/relational.py: DataFrame, merge, .rename
seaborn/categorical.py: DataFrame, iteritems, Series, notnull, option_context, isnull, groupby, get_group,
seaborn/_statistics.py: only Series

Folium

just a single non-test usage, in pd.py:


def validate_location(location):  # noqa: C901
    "...J
    if isinstance(location, np.ndarray) \
            or (pd is not None and isinstance(location, pd.DataFrame)):
        location = np.squeeze(location).tolist()

def if_pandas_df_convert_to_numpy(obj):
    """Return a Numpy array from a Pandas dataframe.
    Iterating over a DataFrame has weird side effects, such as the first
    row being the column names. Converting to Numpy is more safe.
    """
    if pd is not None and isinstance(obj, pd.DataFrame):
        return obj.values
    else:
        return obj

PyJanitor

Interesting/unusual common pattern, which extends pd.DataFrame through pandas_flavor with either accessors or methods:. E.g. from [janitor/biology.py]https://github.com/pyjanitor-devs/pyjanitor/blob/a6832d47d2cc86b0aef101bfbdf03404bba01f3e/janitor/biology.py):

import pandas as pd
import pandas_flavor as pf

@pf.register_dataframe_method
def join_fasta(
    df: pd.DataFrame, filename: str, id_col: str, column_name: str
) -> pd.DataFrame:
    """
    Convenience method to join in a FASTA file as a column.
    """
    ...
    return df

Statsmodels

A huge amount of usage, using a large API surface in a messy way - not easy to do anything with or draw conclusions from.

NetworkX

Mostly just conversions to support pandas dataframes as input/output values. E.g., from convert.py and convert_matrix.py:

def to_networkx_graph(data, create_using=None, multigraph_input=False):
    """Make a NetworkX graph from a known data structure."""
        # Pandas DataFrame
    try:
        import pandas as pd

        if isinstance(data, pd.DataFrame):
            if data.shape[0] == data.shape[1]:
                try:
                    return nx.from_pandas_adjacency(data, create_using=create_using)
                except Exception as err:
                    msg = "Input is not a correct Pandas DataFrame adjacency matrix."
                    raise nx.NetworkXError(msg) from err
            else:
                try:
                    return nx.from_pandas_edgelist(
                        data, edge_attr=True, create_using=create_using
                    )
                except Exception as err:
                    msg = "Input is not a correct Pandas DataFrame edge-list."
                    raise nx.NetworkXError(msg) from err
    except ImportError:
        warnings.warn("pandas not found, skipping conversion test.", ImportWarning)

def from_pandas_adjacency(df, create_using=None):
    try:
        df = df[df.index]
    except Exception as err:
        missing = list(set(df.index).difference(set(df.columns)))
        msg = f"{missing} not in columns"
        raise nx.NetworkXError("Columns must match Indices.", msg) from err

    A = df.values
    G = from_numpy_array(A, create_using=create_using)

    nx.relabel.relabel_nodes(G, dict(enumerate(df.columns)), copy=False)
    return G

And using the .drop method in group.py:

def prominent_group(
    G, k, weight=None, C=None, endpoints=False, normalized=True, greedy=False
):
    import pandas as pd
    ...
    betweenness = pd.DataFrame.from_dict(PB)
    if C is not None:
        for node in C:
            # remove from the betweenness all the nodes not part of the group
            betweenness.drop(index=node, inplace=True)
            betweenness.drop(columns=node, inplace=True)
    CL = [node for _, node in sorted(zip(np.diag(betweenness), nodes), reverse=True)]

Perspective

A multi-language (streaming) viz and analytics library. The Python version uses pandas in core/pd.py. It uses a small but nontrivial amount of the API, including MultiIndex, CategoricalDtype, and time series functionality.

Scikit-learn

TODO: the usage of Pandas in scikit-learn is very much in flux, and more support for "dataframe in, dataframe out" is being added. So it did not seem to make much sense to just look at the code, rather it makes sense to have a chat with the people doing the work there.

Matplotlib

Added because it comes up a lot. Matplotlib uses just a "dictionary of array-likes" approach, no dependence on pandas directly. So it will work today with other dataframe libraries as well, as long as their columns can convert to a numpy array.

rgommers commented 2 years ago

Other libraries that were suggested as candidates to look into: Xarray, cuDF (utilities), PyJanitor (cleaning functionality, not the pandas_flavor domain-specific parts), https://github.com/sfu-db/dataprep

rgommers commented 2 years ago

PyJanitor (non `pandas_flavor` code)

Not repeating DataFrame, Series and .columns, those are used a lot.

utils.py: .iloc, RangeIndex, MultiIndex, .empty, Index
functions/add_columns.py: .copy, .add_column
functions/case_when.py: .assign, .mask, .index, Index, .nlevels, .ndim, .size, __len__
functions/clean_names.py: .rename, .__dict__
functions.coalesce.py: .filter, .bfill, .ffill, .assign
functions/complete.py: .copy, .merge, .groupby, .apply, .droplevel, .loc, Index, MultiIndex
functions/conditional_join.py: .loc, .index, .empty, .copy, RangeIndex, MultiIndex, index, append, .to_numpy, .dtypes, .items, .join
functions/convert_date.py: to_datetime, .astype, .apply
functions/count_cumulative_unique.py: .drop_duplicates, .assign, .cumsum, .index, .reindex, .ffill, .astype
functions/currency_column_to_numeric.py: to_numeric, .loc, .assign, .apply,

There's a ton more - it uses a fairly large part of the pandas API surface. Even in utils, a lot of the code is in functions that get then tacked onto pd.DataFrame with @pandas_flavor.register_dataframe_method. It does not seem like a great target for initial support via a developer-focused API. Detailed usage data is available at https://github.com/data-apis/python-record-api/blob/master/data/api/pyjanitor.json

Xarray

Detailed usage data is also available at https://github.com/data-apis/python-record-api/blob/master/data/api/xarray.json; that data and a cursory search through the Xarray code base for "import pandas" shows that it uses an even larger API surface. A decent amount of that usage is in tests - that's not actually relevant. This is one of the downsides of the automated analysis tooling, if one traces pandas API usage from running the Xarray test suite, then it's hard to figure out whether the public pandas API usage is from the test files or the "under test" files. Pandas is still used in a lot of places though:

Note that Index is most commonly used, followed by Series and DataFrame, the below listing leaves them out of the results for some files.

testing.py: Index,
conventions.py: MultiIndex, isnull, .any, __not__,
convert.py: isnull
coding/times.py: Timestamp, to_timedelta, to_datetime, __version__, notnull, isnull, DatetimeIndex
coding/frequencies.py: Series, DatetimeIndex, TimedeltaIndex, infer_freq
coding/cftimeindex.py: Index, TimedeltaIndex
coding/variables.py: isnull
core/common.py: Index, Grouper
core/nputils.py: isnull
core/merge.py: Series, DataFrame, Panel, Index
core/dataarray.py: Series, DataFrame, MultiIndex, Timedelta, isnull
core/concat.py: unique
core/resample_cftime.py: Series, .duplicated
core/pdcompat.py: Panel
core/accessor_dt.py: .dt
core/duck_array_ops.py: Timedelta, to_timedelta, .astype
core/utils.py: .factorize, MultiIndex, isnull
core/variable.py: Timestamp, MultiIndex.names, MultiIndex.set_names,
core/indexing.py: MultiIndex + methods: .nlevels/.get_loc/.get_loc_level, CategoricalIndex, PeriodIndex, NaT, Timestamp
core/indexes.py: MultiIndex + method from_arrays, CategoricalIndex + method remove_unused_categories,
core/dataset.py: MultiIndex, Categorical + .codes/.categories
core/groupby.py: factorize, DateOffset + .loffset, DatetimeIndex, cut, MultiIndex
core/alignment.py: Index + .union, .intersection
core/missing.py: isnull, MultiIndex, Timedelta, DatetimeIndex
core/coordinates.py: MultiIndex.from_product
core/formatting.py: isnull, Timestamp, Timedelta, .astype
plot/dataset_plot.py: Interval
plot/plot.py: notnull

There is a ton of isinstance usage (e.g. with the various index objects), because Xarray supports both its own container/index classes and pandas ones. Usage seems to be quite different from typical/idiomatic Pandas usage, because Xarray has pretty specific needs.

dataprep

https://github.com/sfu-db/dataprep doesn't seem suitable for analysis - it contains 212 files with pandas imports, a lot of them quite niche (example: a separate file for Albanian VAT number cleaning/validation).

mwaskom commented 2 years ago

Hey, very cool initiative — it would be great to be more agnostic to dataframe libraries.

I wanted to flag that seaborn is in the midst of a very extensive internal refactor, which means that the survey of pandas usage in the library is likely to be out of date after future releases.

But there's an upside: it's a perfect time to be revisiting how the pandas API is used in seaborn and to proactively think about working with a more general dataframe interface. I could see the ongoing work evolving in parallel with this project (hopefully in a way that's mutually beneficial).

Let me know if I can be helpful here!

thomasjpfan commented 2 years ago

Scikit-learn mostly treats a DataFrame as a "2D ndarray with column names". Only the OrdinalEncoder and OneHotEncoder treats the data frame as "a collection of 1D arrays".

When scikit-learn's models start returning DataFrames, it will depend on the fact that there is a zero-copy round-trip from numpy: https://github.com/pandas-dev/pandas/issues/27211. In detail:

First model does computation with ndarrays and is converted to a DataFrame when returned.
The DataFrame is passed into a second model, which internally converts the DataFrame into a ndarray for computation.

Scikit-learn requires that 2d ndarray -> DataFrame -> 2d ndarray not make any copies so no additional memory is allocated.

rgommers commented 2 years ago

Interesting, thanks for sharing @thomasjpfan.

When scikit-learn's models start returning DataFrames, it will depend on the fact that there is a zero-copy round-trip from numpy: pandas-dev/pandas#27211.

The answers from the Pandas devs there are along the lines of what I'd expect: this isn't necessarily guaranteed in the future. That's more a "labeled array" use case which is Xarray like. Did anything change after that 2019 discussion @thomasjpfan, or is it more a "fingers crossed that Pandas doesn't change this"?

Scikit-learn requires that 2d ndarray -> DataFrame -> 2d ndarray not make any copies so no additional memory is allocated

I think pragmatically there is likely to always be a way for Pandas to do this; scikit-learn is probably important enough that it can have even its own method for this if needed. Conceptually it's not a nice fit though for a standardized dataframe behavior; it only works for a subset of supported dtypes, and it's going to need support for a constructor which accepts 2-D arrays to begin with.

thomasjpfan commented 2 years ago

is it more a "fingers crossed that Pandas doesn't change this"?

It's fingers crossed. I've seen a proposal for a 2D extension array, but I think there is a lot more momentum for 1d extension arrays & a columnar store.

I want to add: There are certain models, such as StandardScalar, that can treat the dataframe as a "1d collection of arrays" but is not implemented that way yet. Other models such as PCA will always need to concat the 1d arrays into a 2d array to work.

MarcoGorelli commented 1 year ago

PyJanitor (cleaning functionality, not the pandas_flavor domain-specific parts)

Looks like they only really use rename here, which could easily be standardised

https://github.com/pyjanitor-devs/pyjanitor/blob/7ad98e3564f86534094e4eb425d85ff9a25a3679/janitor/functions/clean_names.py#L84-L106

The trickier part is this decorator, which also uses pandas_flavour:

https://github.com/pyjanitor-devs/pyjanitor/blob/7ad98e3564f86534094e4eb425d85ff9a25a3679/janitor/functions/clean_names.py#L11-L12

pyjanitor adds an extra clean_names method to the pandas DataFrame. How would they make use of the Standard - would they add such a method to all DataFrame objects who have some implementation of the standard? Would the Standard need to require some decorator that can be used to register custom methods? Would it actually be possible for pyjanitor to then register clean_names as a method for all libraries, without having to list them all explicitly? Asking because I don't know - although it strikes me as unlikely

rgommers commented 1 year ago

It looks to me like there are two separate things in PyJanitor:

Functionality implemented through code that calls pandas APIs (dataframe methods and attributes mostly, not just rename)
An unusual way of exposing its own PyJanitor API, namely injecting methods into the dataframe of another library, rather than providing standalone functions.

(2) looks motivated only by UX reasons (I could well be wrong here, not being an active user) - dataframe users tend to like methods over functions. It seems unhealthy to me, because one library monkeypatching another library is a big no-no in library design. Any df.new_meth(...) could have been new_func(df, ...) instead I think.

It's actually an interesting question whether (2) should be allowed through a registration mechanism, or it should be discouraged. I'd lean towards the latter, but then again I'm coming from a domain where a functional programming style is preferred over an object-oriented one. If dataframe library author prefer the former, then a well-defined extension mechanism seems useful. Even for PyJanitor + Pandas only.

MarcoGorelli commented 1 year ago

OK true, their methods do work as functions too:

In [2]: from janitor.functions.clean_names import clean_names

In [3]: df = pd.DataFrame({'A ': [1, 2, 3]})

In [4]: df
Out[4]:
   A
0   1
1   2
2   3

In [5]: clean_names(df)
Out[5]:
   a_
0   1
1   2
2   3

So, perhaps that's the part which the standard can target. It might be worthwhile to try taking a handful of functions from them, say:

clean_names
drop_constant_columns
min_max_scale

Then try implementing the Standard for each DataFrame library, seeing if it's sufficient, and whether this would let pyjanitor "just work" on all of them if it was rewritten to use the standard api

jorisvandenbossche commented 1 year ago

If dataframe library author prefer the former, then a well-defined extension mechanism seems useful. Even for PyJanitor + Pandas only.

FWIW, for pandas itself this already exists (https://pandas.pydata.org/docs/dev/development/extending.html#registering-custom-accessors), and this is also what pyjanitor / pandas_flavor use under the hood (pandas_flavor adds some convenience layer on top of it).

Whether this would also be useful for a DataFrame standard is of course a different question. I think if our goal is to provide a developer-oriented standard API, this is much less needed.

MarcoGorelli commented 1 year ago

Other tools which have been mentioned as potential targets:

featuretools
pandera

MarcoGorelli commented 1 year ago

This one would be a good candidate, namely because they already support both pandas and polars: https://github.com/Kanaries/pygwalker

MarcoGorelli commented 1 year ago

Well this is encouraging:

Now, all pandas-specific logic is isolated to specific modules, where support for additional non-pandas-compliant schema specifications and their associated backends can be implemented either as 1st-party-maintained libraries (see issues for supporting https://github.com/unionai-oss/pandera/issues/1064 and https://github.com/unionai-oss/pandera/issues/1105) or 3rd party libraries.

https://github.com/unionai-oss/pandera/releases/tag/v0.14.0

MarcoGorelli commented 1 year ago

altair have added support for polars by using the interchange protocol: https://github.com/altair-viz/altair

pyarrow is required as a dependency for this to work though - with the standard, they could potentially support polars (and many others) without requiring extra deps? one to look into

EDIT: I don't think altair is a good candidate, see #133

MarcoGorelli commented 1 year ago

Dropping Dask for now, as they've said this wouldn't solve an actual pain-point of theirs

Anyway, https://github.com/feature-engine/feature_engine looks like a good candidate, and exactly the kind of library where this might be useful!

MarcoGorelli commented 1 year ago

Here's a really good one

https://github.com/Nixtla/statsforecast/blob/c732a6101ce0c9daec886928e0f68371772fcccc/statsforecast/core.py#L540-L633

they literally have

if isinstance(self.dataframe, pl.DataFrame):
    # pandas-specific logic
elif isinstance(self.dataframe, pd.DataFrame):
    # polars-specific logic
else:
   raise

So yeah, really solid candidate here

MarcoGorelli commented 1 year ago

another one, where they've already said that their objective is to support multiple dataframe backends https://github.com/skrub-data/skrub

others:

scikit-lego
tsfresh
pandas-ta

cosmicBboy commented 1 year ago

hi all! pandera author here 👋, just wanted to drop a note here to say we're going to start investing resources in pandera-polars support: https://github.com/unionai-oss/pandera/issues/1064.

Not sure how far along this project is but would love to get some tips on how to design the polars validation backend as described in this mini-roadmap: https://github.com/unionai-oss/pandera/issues/1064#issuecomment-1584655803.

Was planning on forging ahead with polars-specific implementations for various things that pandera does during the validation pipeline (see anywhere there's a check_obj variable in the pandas backend as an example). If there's anything we should keep in mind as we build it out please add comments to that issue ^^, we'd really appreciate it!

data-apis / dataframe-api