astronomer / astro-sdk

Astro SDK allows rapid and clean development of {Extract, Load, Transform} workflows using Python and SQL, powered by Apache Airflow.
https://astro-sdk-python.rtfd.io/
Apache License 2.0
347 stars 43 forks source link

Explore possibility to make better experience with redshift operator #911

Open pankajastro opened 2 years ago

pankajastro commented 2 years ago

Please describe the feature you'd like to see Loading to redshift from s3 seems more complicated than traditional Amazon S3toRedshift operator and also run into the header issue with it. https://www.notion.so/astronomerio/ML-Batch-Inference-with-Astro-SDK-Issues-Summary-0703c2fe6abf485da514b45e5db9651f

Relevant link : https://astronomer.slack.com/archives/C02B8SPT93K/p1663942705854959

Describe the solution you'd like A clear and concise description of what you want to happen, if possible with code examples.

Are there any alternatives to this feature? Is there another way we could solve this problem or enable this use-case?

Additional context Add any other context about the feature request here.

Acceptance Criteria

kaxil commented 2 years ago

@pankajkoti Any update on this? Can you spend some time tomorrow or on Monday to see if you can take care of this, please?

pankajkoti commented 2 years ago

Hi @kaxil , yes I had taken a look at it. The Amazon S3toRedshift operator takes 2 connections https://github.com/apache/airflow/blob/main/airflow/providers/amazon/aws/transfers/s3_to_redshift.py#L79, one is for Redshift Hook and another for S3 Hook. In our base class implementation, we only have 1 connection i.e. for Redshift. The S3Hook has get_credentials method which gives IAM based key-pair and token and then we do not need IAM role to be explicitly passed. I was thinking how should we accept another connection ID (for S3Hook) here: https://github.com/astronomer/astro-sdk/blob/main/python-sdk/src/astro/databases/aws/redshift.py#L73 so that we can use something like this: https://github.com/apache/airflow/blob/main/airflow/providers/amazon/aws/transfers/s3_to_redshift.py#L131.

Our base class implementation just has conn_id. Should I add another param s3_conn_id here https://github.com/astronomer/astro-sdk/blob/main/python-sdk/src/astro/databases/aws/redshift.py#L73 ?

pankajkoti commented 1 year ago

Can we move this to next milestone please, looks difficult to find time this week for the story cc: @pankajastro @phanikumv @tatiana

pankajkoti commented 1 year ago

It looks difficult for me to work on it at this point in time. So I'm un-assigning myself in case someone is willing to pick it up.