airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
16.01k stars 4.11k forks source link

Reinstate "namespace" support within "stream" object in public API (was somehow dropped in transition from server api) #47140

Open jmaddern-fw opened 1 week ago

jmaddern-fw commented 1 week ago

Topic

Add "Namespace" definition to "Stream"

Relevant information

I have a scenario where I have multiple (around 100) Postgres sources of identical schemas, similar to:

- postgres_db
    - schema_1
        - table_1
        - table_2
    - schema_2
        - table_1
        - table_2
    - etc.

I have transitioned from octavia to terraform and can see that in that shift support for namespace within a stream has been removed from the new public-api, but was supported in the old server-api and continues to be supported in the UI. This is a limitation of public-api.

Additional Details:

The deprecated Configuration API (server-api) has the field "namespace" included in the "stream" object:

https://airbyte-public-api-docs.s3.us-east-2.amazonaws.com/rapidoc-api-docs.html#post-/v1/connections/create

{
  ...
  "syncCatalog": {
    "streams": [
      {
        "stream": {
          "name": "...",
          "jsonSchema": {...},
          ...,
          "namespace": "string",

        },
        "config": {
          ...
  },
  ...
}

and you're able to send it to the backend. For example, in the browser, I can see that the POST payload to the: http://AIRBYTE_WEBAPP/api/v1/web_backend/connections/create looks like this:

image

At the same time, the public API has no same parameters for that: https://reference.airbyte.com/reference/createconnection

The object "configurations[].streams[]" has "name", "syncMode", "cursorField", "primaryKey" and selectedFields parameters only. Have no idea why the public-api is cut compared with server-api. The public API should really support the same features as the UI.

nataliekwong commented 1 week ago

Hi there, could you accomplish writing to multiple schemas that are present in the source by using the setting "namespaceFormat"?

If you select source here, we will write to the namespace we detect from the source.

Screenshot 2024-10-21 at 9 53 21 AM

Or are you trying to write to a different namespace for each stream which is not present in the source?

jmaddern-fw commented 1 week ago

Hi @nataliekwong - we are already using the equivalent of ${SOURCE_NAMESPACE}, but that isn't the problem.

Using this (also above) as an example:

- postgres_db
    - schema_1
        - table_1
        - table_2
    - schema_2
        - table_1
        - table_2
    - etc.

Source: It is possible to have two separate schemas in the source as shown

image

Connection: Assuming all of the above schemas/tables in the example are in the same CDC replication slot/publication:

So the issue is that we cannot programmatically define multiple namespaces

Priye01 commented 6 days ago

@nataliekwong in addition to what is explained above - using public-api to create connections, within our payload - we want to be able to achieve something like:

"configurations": {
        "streams": [
            {'name': 'table_1',
             'syncMode': 'full_refresh_overwrite',
             'namespace': 'schema_1',
            },
             {'name': 'table_1',
             'syncMode': 'full_refresh_overwrite',
             'namespace': 'schema_2',
            }
        ]
    },

Ideally this should work as it does with server-ap but with public-api it returns an error -

duplicate stream found in configuration for table_1.

Meaning namespace does not create a unique stream as it previously did. ps: I have 1000+ schemas to read from, all with identical tables.