fybrik / airbyte-module

A FybrikModule based on Airbyte
Apache License 2.0
3 stars 6 forks source link

Redefine "file" connection structure #64

Closed shlomitk1 closed 1 year ago

shlomitk1 commented 1 year ago

The current connection is defined in asset.yaml as follows:

connection:
      name: file
      file:
        connector: "airbyte/source-file"
        dataset_name: userdata
        format: parquet
        url: "https://github.com/Teradata/kylo/raw/master/samples/sample-data/parquet/userdata2.parquet"
        provider:
          storage: HTTPS

A number of issues:

  1. Name "file" is misleading. Files can be local or remote, and not necessarily all connectors supporting one type, support another. Thus, I suggest changing the name. "remote"/"https" - suggestions are welcome.
  2. "connector" property is an implementation detail, not a part of the connection. It should be determined in module charts.
  3. "provider" is redundant since "https" appears in the url, unless I am missing here something.
  4. "dataset_name" is 1. redundant and 2. incorrect since it appears in the url as userdata2.
  5. format is not a part of the connection, it appears under dataFormat of the asset. I suggest having something like this:
    connection:
      name: remote
      remote:
        protocol: HTTPS
        host: "github.com"
        folder: "Teradata/kylo/raw/master/samples/sample-data/parquet"
        file: "userdata2.parquet"

    or

    connection:
      name: https
      https:
        url: "https://github.com/Teradata/kylo/raw/master/samples/sample-data/parquet/userdata2.parquet"

The connection should be defined in Fybrik in a custom layer pkg/storage/layers/connection.yaml @cdoron @Mohammad-nassar10 @revit13 @simanadler