aws / aws-sdk-pandas

pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
https://aws-sdk-pandas.readthedocs.io
Apache License 2.0
3.84k stars 678 forks source link

feat: Support different merge conditions in `athena.to_iceberg` function #2861

Closed aldder closed 1 week ago

aldder commented 1 week ago

Feature or Bugfix

Detail

Sometimes, when inserting new data into an iceberg table it may be necessary to ignore existing data on some fields and load only the new ones, to avoid having unwanted overwrites on all columns of the dataset.

In this case a check has been added on the type of merge you want to do, by default an update is done, or you can choose to ignore duplicate entries and go into insert with only the new ones.

Use case:

df1:
| title   |   year |    gross |
|:--------|-------:|---------:|
| Dune    |   1984 | 35000000 |
| Fargo   |   1996 | 60000000 |
df2:
| title   |   year |     gross |
|:--------|-------:|----------:|
| Dune    |   2021 | 400000000 |
| Fargo   |   1996 |  60000001 |
case 1: UPDATE (default)
wr.athena.to_iceberg(..., merge_cols=["title", "year"])

out:
| title   |   year |     gross |
|:--------|-------:|----------:|
| Dune    |   1984 |  35000000 |
| Fargo   |   1996 |  60000001 |
| Dune    |   2021 | 400000000 |
case 2: IGNORE
wr.athena.to_iceberg(..., merge_cols=["title", "year"], merge_condition="ignore")

out:
| title   |   year |     gross |
|:--------|-------:|----------:|
| Dune    |   1984 |  35000000 |
| Fargo   |   1996 |  60000000 |
| Dune    |   2021 | 400000000 |

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

malachi-constant commented 1 week ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant commented 1 week ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant commented 1 week ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant commented 1 week ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository