airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
16.25k stars 4.15k forks source link

[STORAGE-S3] Add Support for Custom S3 Endpoint #44174

Open Sebastien73 opened 3 months ago

Sebastien73 commented 3 months ago

Helm Chart Version

0.429.0

What step the error happened?

Upgrading the Platform or Helm Chart

Relevant information

Issue: Add Support for Custom S3 Endpoint in log4j2-s3.xml

Description

Currently, the log4j2-s3.xml configuration file in the Airbyte platform is hardcoded to work with Amazon S3 as the default storage option. However, when deploying Airbyte via the Helm chart, it's not possible to override the default AWS endpoint (amazonaws.com) to use S3-compatible storage services from other providers (e.g., Scaleway).

To illustrate, it's not possible to set an endpoint like https://s3.fr-par.scw.cloud for Scaleway’s S3-compatible storage service.

For instance, it is not currently possible to use an endpoint like https://s3.fr-par.scw.cloud.

Proposal

To enhance compatibility with other S3-compatible storage services, I propose adding support for a custom S3 endpoint in the log4j2-s3.xml file. Below is a suggested change:

<Properties>
    <Property name="ci-mode">${sys:ciMode:-false}</Property>
    <!--
    date format is datadog friendly so that it can try to detect multilines logs
    https://github.com/DataDog/datadog-agent/blob/a27c16c05da0cf7b09d5a5075ca568fdae1b4ee0/pkg/logs/internal/decoder/auto_multiline_handler.go#L208

    trace_id and span_id are added to the logs so that datadog can correlate logs and traces https://docs.datadoghq.com/tracing/other_telemetry/connect_logs_and_traces/
    https://app.datadoghq.com/logs/pipelines?search=env%3Aci
     -->
    <Property name="pattern-with-trace-id">%d{yyyy-MM-dd HH:mm:ss,SSS}{GMT+0} [dd.trace_id=%X{dd.trace_id} dd.span_id=%X{dd.span_id}] %p %C{1.}(%M):%L %replace{%m}{apikey=[\w\-]*}{apikey=*****}%n</Property>
    <!-- Mask the string apikey=<string> to apikey=***** to prevent secrets leaking. -->
    <Property name="default-pattern">%d{yyyy-MM-dd HH:mm:ss}{GMT+0} %highlight{%p} %C{1.}(%M):%L - %replace{%m}{apikey=[\w\-]*}{apikey=*****}%n</Property>
    <!--Logs the timestamp and log_source/application name in the beginning of the line if it exists with a > separator, and then always the rest of the line.-->
    <Property name="simple-pattern">%d{yyyy-MM-dd HH:mm:ss}{GMT+0}%replace{ %X{log_source}}{^ -}{} > %replace{%m}{apikey=[\w\-]*}{apikey=*****}%n</Property>

    <!-- Always log INFO by default. -->
    <Property name="log-level">${sys:LOG_LEVEL:-${env:LOG_LEVEL:-INFO}}</Property>

    <Property name="route-ttl">${env:LOG_IDLE_ROUTE_TTL:-15}</Property>

    <!-- Note that logging to S3 will leverage the DefaultAWSCredentialsProviderChain for auth. -->
    <Property name="s3-bucket">${sys:STORAGE_BUCKET_LOG:-${env:STORAGE_BUCKET_LOG:-}}</Property>
    <Property name="s3-region">${sys:AWS_DEFAULT_REGION:-${env:AWS_DEFAULT_REGION:-}}</Property>
    <!-- Add the following property to allow custom S3 endpoint -->
    <Property name="s3-endpoint">${sys:AWS_ENDPOINT_URL:-${env:AWS_ENDPOINT_URL:-}}</Property>
</Properties>

In addition to modifying the log4j2-s3.xml file, a corresponding change would be necessary in the values.yaml file within the Helm chart to support custom S3 endpoints. The change should be made in the global.storage section, as shown below:

storage:
  # -- The storage backend type. Supports s3, gcs, minio (default)
  type: s3 # change to your preferred storage type
  # -- Secret name where storage provider credentials are stored
  #storageSecretName: "airbyte-config-secrets"

  # S3
  bucket: ## S3 bucket names that you've created. We recommend storing the following all in one bucket.
    log: airbyte-bucket
    state: airbyte-bucket
    workloadOutput: airbyte-bucket
  s3:
    region: "" ## e.g. us-east-1
    authenticationType: credentials ## Use "credentials" or "instanceProfile"

#############################################
# Add the following to support custom S3 endpoint:
    endpoint: <CUSTOM_ENDPOINT_URL>
#############################################

  # GCS
  #bucket: ## GCS bucket names that you've created. We recommend storing the following all in one bucket.
  #  log: airbyte-bucket
  #  state: airbyte-bucket
  #  workloadOutput: airbyte-bucket
  #gcs:
  #  projectId: <project-id>
  #  credentialsJson: /secrets/gcs-log-creds/gcp.json

Expected Outcome

These changes will allow users to specify an alternative S3-compatible endpoint when configuring Airbyte with non-AWS S3 storage providers. This enhancement will improve the flexibility and compatibility of Airbyte, enabling its deployment in a broader range of cloud storage environments.

Relevant log output

No response

aqeelat commented 2 months ago

try setting

global:
  storage:
    minio:
      endpoint: "your custom s3 endpoint"
Sebastien73 commented 2 months ago

try setting

global:
  storage:
    minio:
      endpoint: "your custom s3 endpoint"

I have tested this, but with "minio" at the storage type, I can't change the default value of the region. My S3 use "fr-par" region.

With "minio" storage type, I can change the endpoint but not the region and when I use "s3" storage type, I can change the region but not the endpoint.

djpirra commented 2 months ago

So storage type should be set as S3 and have a minio entry under storage?

If storage type is set to minio it will create a minio instance.

Sebastien73 commented 2 months ago

Hello,

When I used the S3 storage type in my values.yaml, I can't set a custom endpoint. This value stay by default at aws endpoint. This is why I tested the use of the "minio" storage type, because this type has the possibility of having a customizable endpoint except the region parameter which remains by default on AWS.

I would like to have possibilities to custom "region" and "endpoint" parameter for used object storage with Scaleway provider.

First test - With s3 storage type :

 globale:
   storage:
       type: s3
         bucket: ## S3 bucket names that you've created. We recommend storing the following all in one bucket.
         activityPayload: airbyte-s3
         log: airbyte-s3
         state: airbyte-s3
         workloadOutput: airbyte-s3
      s3:
        region: fr-par ## e.g. us-east-1
        authenticationType: credentials ## Use "credentials" or "instanceProfile"
        accessKeyIdSecretKey: AWS_ACCESS_KEY_ID
        secretAccessKeySecretKey: AWS_SECRET_ACCESS_KEY

        It's here where I need to have something to custom endpoint like the "region" parameter.

Second test - With minio storage type :

 globale:
   storage:
       type: minio
          endpoint : custom endpoint

      ANd here I can't change the value of the "region" parameter. 
ssaunier commented 3 weeks ago

Being able to use S3-compatible services of different IaaS providers (Digital Ocean, Scaleway, etc.) and not just AWS S3 would be a great addition indeed 👍 !