[dagster-pandas] Evaluate feather as default pandas df serdes strategy

dagster-io / dagster

An orchestration platform for the development, production, and observation of data assets.

Apache License 2.0

11.94k stars 1.49k forks source link

Use Case

The current default pandas df serdes strategy is pickle. However, pickle is not great at handling large dataframes (say 10M+ rows, 30+ cols w/ datetimes, strings, floats, etc). We should have smooth out-of-the-box support.

df.to_feather / pd.read_feather is built to handle large pandas dataframes, and has better default datetime handling than df.to_csv and pd.read_csv. However, feather uses a binary format that makes it harder for users to visually inspect intermediates.

We should consider using feather as our default pandas serdes strategy.

Ideas of Implementation

Additional Info

Message from the maintainers:

Excited about this feature? Give it a :thumbsup:. We factor engagement into prioritization.

dagster-io / dagster