Open rgommers opened 2 years ago
Other libraries that were suggested as candidates to look into: Xarray, cuDF (utilities), PyJanitor (cleaning functionality, not the pandas_flavor
domain-specific parts), https://github.com/sfu-db/dataprep
pandas_flavor
code)Not repeating DataFrame
, Series
and .columns
, those are used a lot.
utils.py
: .iloc
, RangeIndex
, MultiIndex
, .empty
, Index
functions/add_columns.py
: .copy
, .add_column
functions/case_when.py
: .assign
, .mask
, .index
, Index
, .nlevels
, .ndim
, .size
, __len__
functions/clean_names.py
: .rename
, .__dict__
functions.coalesce.py
: .filter
, .bfill
, .ffill
, .assign
functions/complete.py
: .copy
, .merge
, .groupby
, .apply
, .droplevel
, .loc
, Index
, MultiIndex
functions/conditional_join.py
: .loc
, .index
, .empty
, .copy
, RangeIndex
, MultiIndex
, index
, append
, .to_numpy
, .dtypes
, .items
, .join
functions/convert_date.py
: to_datetime
, .astype
, .apply
functions/count_cumulative_unique.py
: .drop_duplicates
, .assign
, .cumsum
, .index
, .reindex
, .ffill
, .astype
functions/currency_column_to_numeric.py
: to_numeric
, .loc
, .assign
, .apply
, There's a ton more - it uses a fairly large part of the pandas API surface. Even in utils, a lot of the code is in functions that get then tacked onto pd.DataFrame
with @pandas_flavor.register_dataframe_method
. It does not seem like a great target for initial support via a developer-focused API. Detailed usage data is available at https://github.com/data-apis/python-record-api/blob/master/data/api/pyjanitor.json
Detailed usage data is also available at https://github.com/data-apis/python-record-api/blob/master/data/api/xarray.json; that data and a cursory search through the Xarray code base for "import pandas" shows that it uses an even larger API surface. A decent amount of that usage is in tests - that's not actually relevant. This is one of the downsides of the automated analysis tooling, if one traces pandas API usage from running the Xarray test suite, then it's hard to figure out whether the public pandas API usage is from the test files or the "under test" files. Pandas is still used in a lot of places though:
Note that Index
is most commonly used, followed by Series
and DataFrame
, the below listing leaves them out of the results for some files.
testing.py
: Index
, conventions.py
: MultiIndex
, isnull
, .any
, __not__
, convert.py
: isnull
coding/times.py
: Timestamp
, to_timedelta
, to_datetime
, __version__
, notnull
, isnull
, DatetimeIndex
coding/frequencies.py
: Series
, DatetimeIndex
, TimedeltaIndex
, infer_freq
coding/cftimeindex.py
: Index
, TimedeltaIndex
coding/variables.py
: isnull
core/common.py
: Index
, Grouper
core/nputils.py
: isnull
core/merge.py
: Series
, DataFrame
, Panel
, Index
core/dataarray.py
: Series
, DataFrame
, MultiIndex
, Timedelta
, isnull
core/concat.py
: unique
core/resample_cftime.py
: Series
, .duplicated
core/pdcompat.py
: Panel
core/accessor_dt.py
: .dt
core/duck_array_ops.py
: Timedelta
, to_timedelta
, .astype
core/utils.py
: .factorize
, MultiIndex
, isnull
core/variable.py
: Timestamp
, MultiIndex.names
, MultiIndex.set_names
, core/indexing.py
: MultiIndex
+ methods: .nlevels
/.get_loc
/.get_loc_level
, CategoricalIndex
, PeriodIndex
, NaT
, Timestamp
core/indexes.py
: MultiIndex
+ method from_arrays
, CategoricalIndex
+ method remove_unused_categories
, core/dataset.py
: MultiIndex
, Categorical
+ .codes
/.categories
core/groupby.py
: factorize
, DateOffset
+ .loffset
, DatetimeIndex
, cut
, MultiIndex
core/alignment.py
: Index
+ .union
, .intersection
core/missing.py
: isnull
, MultiIndex
, Timedelta
, DatetimeIndex
core/coordinates.py
: MultiIndex.from_product
core/formatting.py
: isnull
, Timestamp
, Timedelta
, .astype
plot/dataset_plot.py
: Interval
plot/plot.py
: notnull
There is a ton of isinstance
usage (e.g. with the various index objects), because Xarray supports both its own container/index classes and pandas ones. Usage seems to be quite different from typical/idiomatic Pandas usage, because Xarray has pretty specific needs.
https://github.com/sfu-db/dataprep doesn't seem suitable for analysis - it contains 212 files with pandas imports, a lot of them quite niche (example: a separate file for Albanian VAT number cleaning/validation).
Hey, very cool initiative — it would be great to be more agnostic to dataframe libraries.
I wanted to flag that seaborn is in the midst of a very extensive internal refactor, which means that the survey of pandas usage in the library is likely to be out of date after future releases.
But there's an upside: it's a perfect time to be revisiting how the pandas API is used in seaborn and to proactively think about working with a more general dataframe interface. I could see the ongoing work evolving in parallel with this project (hopefully in a way that's mutually beneficial).
Let me know if I can be helpful here!
Scikit-learn mostly treats a DataFrame as a "2D ndarray with column names". Only the OrdinalEncoder
and OneHotEncoder
treats the data frame as "a collection of 1D arrays".
When scikit-learn's models start returning DataFrames, it will depend on the fact that there is a zero-copy round-trip from numpy: https://github.com/pandas-dev/pandas/issues/27211. In detail:
Scikit-learn requires that 2d ndarray -> DataFrame -> 2d ndarray not make any copies so no additional memory is allocated.
Interesting, thanks for sharing @thomasjpfan.
When scikit-learn's models start returning DataFrames, it will depend on the fact that there is a zero-copy round-trip from numpy: pandas-dev/pandas#27211.
The answers from the Pandas devs there are along the lines of what I'd expect: this isn't necessarily guaranteed in the future. That's more a "labeled array" use case which is Xarray like. Did anything change after that 2019 discussion @thomasjpfan, or is it more a "fingers crossed that Pandas doesn't change this"?
Scikit-learn requires that 2d ndarray -> DataFrame -> 2d ndarray not make any copies so no additional memory is allocated
I think pragmatically there is likely to always be a way for Pandas to do this; scikit-learn is probably important enough that it can have even its own method for this if needed. Conceptually it's not a nice fit though for a standardized dataframe behavior; it only works for a subset of supported dtypes, and it's going to need support for a constructor which accepts 2-D arrays to begin with.
is it more a "fingers crossed that Pandas doesn't change this"?
It's fingers crossed. I've seen a proposal for a 2D extension array, but I think there is a lot more momentum for 1d extension arrays & a columnar store.
I want to add: There are certain models, such as StandardScalar
, that can treat the dataframe as a "1d collection of arrays" but is not implemented that way yet. Other models such as PCA
will always need to concat the 1d arrays into a 2d array to work.
PyJanitor (cleaning functionality, not the pandas_flavor domain-specific parts)
Looks like they only really use rename
here, which could easily be standardised
The trickier part is this decorator, which also uses pandas_flavour
:
pyjanitor
adds an extra clean_names
method to the pandas DataFrame. How would they make use of the Standard - would they add such a method to all DataFrame objects who have some implementation of the standard?
Would the Standard need to require some decorator that can be used to register custom methods?
Would it actually be possible for pyjanitor to then register clean_names
as a method for all libraries, without having to list them all explicitly? Asking because I don't know - although it strikes me as unlikely
It looks to me like there are two separate things in PyJanitor:
rename
)(2) looks motivated only by UX reasons (I could well be wrong here, not being an active user) - dataframe users tend to like methods over functions. It seems unhealthy to me, because one library monkeypatching another library is a big no-no in library design. Any df.new_meth(...)
could have been new_func(df, ...)
instead I think.
It's actually an interesting question whether (2) should be allowed through a registration mechanism, or it should be discouraged. I'd lean towards the latter, but then again I'm coming from a domain where a functional programming style is preferred over an object-oriented one. If dataframe library author prefer the former, then a well-defined extension mechanism seems useful. Even for PyJanitor + Pandas only.
OK true, their methods do work as functions too:
In [2]: from janitor.functions.clean_names import clean_names
In [3]: df = pd.DataFrame({'A ': [1, 2, 3]})
In [4]: df
Out[4]:
A
0 1
1 2
2 3
In [5]: clean_names(df)
Out[5]:
a_
0 1
1 2
2 3
So, perhaps that's the part which the standard can target. It might be worthwhile to try taking a handful of functions from them, say:
clean_names
drop_constant_columns
min_max_scale
Then try implementing the Standard for each DataFrame library, seeing if it's sufficient, and whether this would let pyjanitor "just work" on all of them if it was rewritten to use the standard api
If dataframe library author prefer the former, then a well-defined extension mechanism seems useful. Even for PyJanitor + Pandas only.
FWIW, for pandas itself this already exists (https://pandas.pydata.org/docs/dev/development/extending.html#registering-custom-accessors), and this is also what pyjanitor / pandas_flavor use under the hood (pandas_flavor
adds some convenience layer on top of it).
Whether this would also be useful for a DataFrame standard is of course a different question. I think if our goal is to provide a developer-oriented standard API, this is much less needed.
Other tools which have been mentioned as potential targets:
This one would be a good candidate, namely because they already support both pandas and polars: https://github.com/Kanaries/pygwalker
Well this is encouraging:
Now, all pandas-specific logic is isolated to specific modules, where support for additional non-pandas-compliant schema specifications and their associated backends can be implemented either as 1st-party-maintained libraries (see issues for supporting https://github.com/unionai-oss/pandera/issues/1064 and https://github.com/unionai-oss/pandera/issues/1105) or 3rd party libraries.
altair have added support for polars by using the interchange protocol: https://github.com/altair-viz/altair
pyarrow is required as a dependency for this to work though - with the standard, they could potentially support polars (and many others) without requiring extra deps? one to look into
EDIT: I don't think altair is a good candidate, see #133
Dropping Dask for now, as they've said this wouldn't solve an actual pain-point of theirs
Anyway, https://github.com/feature-engine/feature_engine looks like a good candidate, and exactly the kind of library where this might be useful!
Here's a really good one
they literally have
if isinstance(self.dataframe, pl.DataFrame):
# pandas-specific logic
elif isinstance(self.dataframe, pd.DataFrame):
# polars-specific logic
else:
raise
So yeah, really solid candidate here
another one, where they've already said that their objective is to support multiple dataframe backends https://github.com/skrub-data/skrub
others:
hi all! pandera
author here 👋, just wanted to drop a note here to say we're going to start investing resources in pandera-polars support: https://github.com/unionai-oss/pandera/issues/1064.
Not sure how far along this project is but would love to get some tips on how to design the polars validation backend as described in this mini-roadmap: https://github.com/unionai-oss/pandera/issues/1064#issuecomment-1584655803.
Was planning on forging ahead with polars-specific implementations for various things that pandera does during the validation pipeline (see anywhere there's a check_obj
variable in the pandas backend as an example). If there's anything we should keep in mind as we build it out please add comments to that issue ^^, we'd really appreciate it!
In other issues we find some detailed analyses of how the pandas API is used today, e.g. gh-3 (on Kaggle notebooks) and in https://github.com/data-apis/python-record-api/tree/master/data/api (for a set of well-known packages). That data is either not relevant for a developer-focused API though, or is so detailed that it's hard to get a good feel for what's important. So I thought it'd be useful to revisit the topic. I used https://libraries.io/pypi/pandas and looked at some of the top repos that declare a dependency on
pandas
.Top 10 listed:
Seaborn
Perhaps the most interesting pandas usage. It's a hard dependency, is used a fair amount and for more than just data access, however it all still seems fairly standard and common so may be a reasonable target to make work with multiple libraries. Uses a lot of
isinstance
checks (onpd.DataFrame
,pd.Series
).seaborn/_core.py
:Series
,to_numeric
seaborn/matrix.py
:DataFrame
,isnull
,.index.equals
,.column.equals
,seaborn/utils.py
:DataFrame
,Categorical
,notnull
seaborn/regression.py
: onlypd.notnull
seaborn/distributions.py
:.values
,.copy
,.iloc
,.loc
,.reset_index
,.index
,set_index
,MultiIndex.from_arrays
,Index
,Series
,concat
,merge
seaborn/relational.py
:DataFrame
,merge
,.rename
seaborn/categorical.py
:DataFrame
,iteritems
,Series
,notnull
,option_context
,isnull
,groupby
,get_group
,seaborn/_statistics.py
: onlySeries
Folium
just a single non-test usage, in pd.py:
PyJanitor
Interesting/unusual common pattern, which extends
pd.DataFrame
through pandas_flavor with either accessors or methods:. E.g. from [janitor/biology.py]https://github.com/pyjanitor-devs/pyjanitor/blob/a6832d47d2cc86b0aef101bfbdf03404bba01f3e/janitor/biology.py):Statsmodels
A huge amount of usage, using a large API surface in a messy way - not easy to do anything with or draw conclusions from.
NetworkX
Mostly just conversions to support pandas dataframes as input/output values. E.g., from convert.py and convert_matrix.py:
And using the
.drop
method in group.py:Perspective
A multi-language (streaming) viz and analytics library. The Python version uses pandas in
core/pd.py
. It uses a small but nontrivial amount of the API, includingMultiIndex
,CategoricalDtype
, and time series functionality.Scikit-learn
TODO: the usage of Pandas in scikit-learn is very much in flux, and more support for "dataframe in, dataframe out" is being added. So it did not seem to make much sense to just look at the code, rather it makes sense to have a chat with the people doing the work there.
Matplotlib
Added because it comes up a lot. Matplotlib uses just a "dictionary of array-likes" approach, no dependence on pandas directly. So it will work today with other dataframe libraries as well, as long as their columns can convert to a numpy array.