data-apis / dataframe-api

RFC document, tooling and other content related to the dataframe API standard
https://data-apis.org/dataframe-api/draft/index.html
MIT License
97 stars 20 forks source link

Relevant dataframe libraries #8

Open rgommers opened 4 years ago

rgommers commented 4 years ago

This issue is meant to collect libraries that we should be aware of and perhaps take into account (data on how their API looks, impact of choices on those libraries, etc.).

See https://github.com/pydata-apis/array-api/issues/3 for relevant array libraries.

TomAugspurger commented 4 years ago

Added mars and staticframe.

jack-pappas commented 4 years ago

Added dexplo and datatable.

dexplo is an interesting one because it's a minimalist design and already adheres to some of the API requirements we've discussed, such as requiring column labels to be strings and unique within a given DataFrame.

datatable is aiming to be a Python implementation of the R data.table library.

datapythonista commented 4 years ago

@rgommers, I think the libraries that would be worth comparing the methods they implement are:

rgommers commented 4 years ago

@datapythonista thanks. Could you add some rationale? Why are Mars, dexplo and Eland interesting and some of the other listed libraries not? I think they're all quite small, and at least dexplo and eland seem to be very young with almost no usage and few contributors. So I'd think the main focus should be on the first six libraries in your list?

datapythonista commented 4 years ago

That's a good point. What I had in mind was to have a comparison of what developers of libraries that copy the pandas API implemented. So, I excluded the ones that don't aim to have a pandas-like API, and didn't consider their popularity.

Not sure if the outcome will tell more about how important the developers considered a feature is, or how easy to implement it was. But since I expect all the libraries in the list to use the same naming as pandas, I think the comparison should be easy to generate.

For Eland, since it's backed by Elastic, there are some things that I would expect to be missing. If we consider that the dataframe API could be used for Ibis-like projects, backed by databases, then there could be some valuable information there.

But in any case, not a problem at all to leave the last three out. I see value of having them if it's not too much effort to extract their APIs, but with the others is surely good enough.

SimonHeybrock commented 2 years ago

scipp, which is conceptually most similar to xarray (with some extra features).

kkraus14 commented 2 years ago

Polars: https://github.com/pola-rs/polars