Polars Backend over Pandas

kailukowiak commented 1 year ago

Is your feature request related to a problem? Please describe. Pandas can be slow and memory intensive. When dealing with large files I need lots more memory in my EC2 instance than if I was using Polars.

Also, and this is a matter of personal preference but the Polars API can be much cleaner.

Describe the solution you'd like It would be really nice if I could use a faster and more memory efficient DataFrame API to ingest and export data.

Describe alternatives you've considered I often convert Pandas DFs to Polars ones, and then process the data before writing it back out. This works fine on small data sets but it would be nice on large ones to never have to allocate all the memory needed for Pandas.

Comments I know this is a large ask and currently Polars isn't that popular but I think this would be a huge performance increase if implemented and would make my ETL much prettier (subjectively) too.

Additional context Add any other context or screenshots about the feature request here.

P.S. Please do not attach files as it's considered a security risk. Add code snippets directly in the message body as much as possible.

jaidisido commented 1 year ago

Hi @kailukowiak, this would indeed be quite a big ask and huge shift for a library named AWS SDK for pandas :)

Have you had a look at our SDK for pandas at scale work?

This is currently available as a release candidate but we hope to release it in the coming weeks.

One major shortcoming of polars that I have raised to the maintainers is that it's limited to a single node. This is why we have preferred to invest in Modin and Ray to support distributed computing.

kailukowiak commented 1 year ago

Hi @jaidisido.

Yes, I definitely understand that it would be a lot of work and my heart did sink when I saw the repo had been renamed (I believe) from aws-datawrangler to aws-skd-pandas 🤣.

I've used wrangler and aws batch/pcluster before and ran into API call throttling issues but it's possible that won't be an issue any more because I was using the now defunct Governed Tables. I presume updating Athena/Lake Formation would be a cheaper and less limited api call.

My main concern with the distributed approach using pandas is the mode="overwrite_partitions" when calling wr.s3.to_parquet. I find this is the best way to update data in our lake is to load an entire day partition into a df, make any changes and then overwrite the partition. However, some days have over 100gb of data which has necessitated using a ~750gb ec2 instance. Would distributed wr.s3.to_csv(...,mode="overwrite_partitions",...) calls result in each ray thread/process overwriting the partition completely?

jaidisido commented 1 year ago

When using Modin/Ray the data is spread across the cluster, whereas with pandas/polars all the data must live in the same node. So instead of using a massive EC2 instance like in your case, you can create a cluster of smaller machines and the library handles distributing the data across. s3.to_csv is one of the supported methods.

kailukowiak commented 1 year ago

Yes, my concern is that because of the distributed nature, the mode="overwrite_partitions" in wr would overwrite all changes except for the last node to complete. However, I presume you've handled this so I don't need to worry.

I'll close the issue now.

Thanks.

johnros commented 11 months ago

Given the popularity that polars has been gaining throughout 2023, what are the odds of revisiting the decision to invest in modin/ray? Polars does seem to be the future of distributed data-frames (within-machine).

brunocous commented 8 months ago

Given all the hype around Polars recently, and other packages like scikit learn now supporting Polars dataframes, it would make sense to re-evaluate this.

MacHu-GWU commented 1 month ago

Start working on this aws_sdk_polars

kailukowiak commented 1 month ago

Start working on this aws_sdk_polars

Sweet. I love it. Thanks.

aws / aws-sdk-pandas

Polars Backend over Pandas #1951