apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
35.98k stars 13.97k forks source link

DynamoDBToS3Operator using native export functionality. #40737

Closed Kuhlmann-Itagyba-bah closed 1 week ago

Kuhlmann-Itagyba-bah commented 1 month ago

Apache Airflow Provider(s)

amazon

Versions of Apache Airflow Providers

apache-airflow-providers-amazon==8.20.0

Apache Airflow version

2.9.1

Operating System

Debian GNU/Linux 12

Deployment

Official Apache Airflow Helm Chart

Deployment details

No response

What happened

DynamoDBToS3Operator does not offer the possibility to use DynamoDB Export native functionality; hence, it scans the entire table, requiring the airflow executor role to actually have access to data, which is potentially dangerous. Also, it does not allow incremental export, which can be used with boto3.

What you think should happen instead

It should add the possibility of using the AWS native export functionality, allowing the Airflow role to execute the data transfer without needing access to the data itself.

How to reproduce

Implement any DynamoDBToS3Operator task without "dynamodb:Scan" permission on the giving table.

Anything else

No response

Are you willing to submit PR?

Code of Conduct

boring-cyborg[bot] commented 1 month ago

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

jayceslesar commented 1 month ago

Is this ask possible? The only method in the dynamo boto3 client listed that seems relevant is export_table_to_point_in_time which the existing codebase already supports https://github.com/apache/airflow/blob/6b9214508ae8ff4d6d39e9ecda5138b5ba717ceb/airflow/providers/amazon/aws/transfers/dynamodb_to_s3.py#L156.

Doesn't seem that a native export is possible without PITR enabled.

The only thing the existing Airflow implementation/passthrough seems to be missing are two (optional) arguments:

  1. ExportType
  2. IncrementalExportSpecification
gyli commented 1 month ago

I agree with @jayceslesar that ExportTableToPointInTime is the only API provided in boto to export DynamoDB data to S3, based on AWS's blog introducing this native export.

@Kuhlmann-Itagyba-bah You may try DynamoDBToS3Operator with argument export_time specified, so it will use _export_table_to_point_in_time method for export.

Ghoul-SSZ commented 2 weeks ago

Hello. I have also ran into this issue this week and have made a PR about it 👆 . Please take a look if you have some time to spare. 🙏 Since its my first time making a PR here, there might be things or edge cases I haven't thought through.
In that case, sorry in advance 🙇

Any helps or suggestions are more than welcome. 😄