dagster-io / dagster

An orchestration platform for the development, production, and observation of data assets.
https://dagster.io
Apache License 2.0
11.94k stars 1.49k forks source link

[dagster-pandas] Evaluate feather as default pandas df serdes strategy #3372

Open catherinewu opened 3 years ago

catherinewu commented 3 years ago

Use Case

The current default pandas df serdes strategy is pickle. However, pickle is not great at handling large dataframes (say 10M+ rows, 30+ cols w/ datetimes, strings, floats, etc). We should have smooth out-of-the-box support.

df.to_feather / pd.read_feather is built to handle large pandas dataframes, and has better default datetime handling than df.to_csv and pd.read_csv. However, feather uses a binary format that makes it harder for users to visually inspect intermediates.

We should consider using feather as our default pandas serdes strategy.

Ideas of Implementation

Additional Info


Message from the maintainers:

Excited about this feature? Give it a :thumbsup:. We factor engagement into prioritization.

sryza commented 3 years ago

Something to consider here is that, with the path that we're currently on with object managers, there will be no such thing as a "default serdes strategy for pandas". I.e. the object manager is responsible for deciding how serialization happens, not the dagster type.

One path here could be to just provide an out-of-the-box object store that uses feather.