airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
15.87k stars 4.07k forks source link

Add support for using an IAM instance profile, instead of provisioned AWS API keys. #5282

Open joekhoobyar opened 3 years ago

joekhoobyar commented 3 years ago

Tell us about the problem you're trying to solve

We are trying to deploy airbyte, in AWS, without provisioning any additional AWS API keys.

Describe the solution you’d like

I would like Airbyte to support using IAM instance profile, like most software.

The AWS SDK already knows how to automatically use the IAM instance profile - you only have to not pass any credentials to it - and it will search for credentials in multiple ways.

Describe the alternative you’ve considered or used

We don't have an alternative, as provisioning API keys for service accounts is against our existing security posture.

Additional context

Add any other context or screenshots about the feature request here.

Are you willing to submit a PR?

I am willing to submit a PR, if somebody can point me to the relevant places in the code that create AWS API clients.

sherifnada commented 3 years ago

@joekhoobyar is this for a particular connector? all of them?

joekhoobyar commented 3 years ago

@sherifnada - it should be the for the whole system, IMO. connectors, as well as the base images airbyte/*.

FYI - Currently, this is blocking our deployment of AirByte for one of our customers, due to their security posture.

sherifnada commented 3 years ago

@joekhoobyar I think I understand the ask in the case of connectors. Can you help me understand it in the "whole system" case -- are you saying when hitting Airbyte via the API, you don't want to use any Airbyte-generated API keys but rather rely on IAM permissions to control authn/authz within Airbyte?

joekhoobyar commented 3 years ago

No, @sherifnada actually, I think something has gotten lost in the translation here.

What I'm asking for is much simpler than that.

Using the AWS SDK, this is quite simple. Simply do not set those environment variables - and AWS will take care of getting the API keys for you, from the instance profile. This is much more secure - since there are no longer any keys to be rotated, leaked, etc.

joekhoobyar commented 3 years ago

For example, here is the documentation for the Java SDK. The other SDKs handle it the same way:

nathan5280 commented 3 years ago

@sherifnada trying to get are heads around the logging when deployed to AWS.

In .env file

# Cloud log backups. Don't use this unless you know what you're doing. Mainly for Airbyte devs.
# If you just want to capture Docker logs, you probably want to use something like this instead:
# https://docs.docker.com/config/containers/logging/configure/
S3_LOG_BUCKET=
S3_LOG_BUCKET_REGION=
AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=
S3_MINIO_ENDPOINT=
S3_PATH_STYLE_ACCESS=

Which sounds like the logging should be handled through docker.

In the docker-compose.yaml file, we see these environment variables for the server and the scheduler

      - S3_LOG_BUCKET=${S3_LOG_BUCKET}
      - S3_LOG_BUCKET_REGION=${S3_LOG_BUCKET_REGION}
      - AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
      - AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}

These are used in airbyte-config/.../S3Logs.java to create the client to write logs to an S3 bucket.

I think there are two questions.

  1. If the logs are handled by docker does the S3 Logging even need to be configured and an S3Client created?
  2. If the S3 Logging does need to be configured can we code this so that if the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY aren't in the environment then the S3Client will use its search mechanisms to find these as @joekhoobyar mentions above.
joekhoobyar commented 3 years ago

All logging errors go away, if I add the following blank environment variables:

    S3_LOG_BUCKET = ""
    S3_LOG_BUCKET_REGION = ""
    S3_MINIO_ENDPOINT = ""
    S3_PATH_STYLE_ACCESS = ""
    GCP_STORAGE_BUCKET = ""
lewisdrummond1 commented 2 years ago

Can someone confirm if S3 logging works from an EC2 instance using an assuming IAM instance profile? Can we avoid having to set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY? @davinchia for awareness

vbhamidipati commented 2 years ago

I am wondering if this can be enhanced to use the DefaultCredentialsProvider. This will allow the logging to be more generic and work everywhere irrespective of how the AWS credentials are configured. Q - can I submit changes as a PR to this core functionality or does this need to be handled by the Airbyte Team ?

davinchia commented 2 years ago

We definitely welcome contributions! This is handled by another open source project: https://github.com/bluedenim/log4j-s3-search. If you contribute to that project, I'm happy to pull in the latest version!