aws / aws-sdk-pandas

pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
https://aws-sdk-pandas.readthedocs.io
Apache License 2.0
3.91k stars 699 forks source link

Possible to write spark dataframes to glue tables in similar fashion as awswrangler.s3.to_parquet #1743

Open aabid0193 opened 1 year ago

aabid0193 commented 1 year ago

If it isn't possible already, it would be nice i we can use spark dataframes to write to glue tables using something similar to wranglers to_parquet method. It works great for pandas and has the ability to set the mode to overwrite partitions and was wondering if we can do this with spark dataframes.

emerson131 commented 1 year ago

wranglers to_parquet method. It works great for pandas and has the ability to set the mode to overwrite partitions and was wondering if we can do this with spark dataframes.

If you are using spark, i would image that simply converting your spark dataframe to a pandas one would get you there if you want to use the wrangler.

sparkDF.toPandas()
aabid0193 commented 1 year ago

yeah that is a possibility that you can do right now, however, for large datasets that required the use of spark this wouldn't be ideal

aabid0193 commented 1 year ago

Essentially what i'm wishing for is the ability to register Athena tables based on the Pyspark dataframe metadata. I see that this was implemented here: https://github.com/aws/aws-sdk-pandas/issues/29. However, it seems to me that this method is no longer supported in the newer versions of wrangler. Additionally would like to overwrite partitions

github-actions[bot] commented 1 year ago

Marking this issue as stale due to inactivity. This helps our maintainers find and focus on the active issues. If this issue receives no comments in the next 7 days it will automatically be closed.