airbytehq / terraform-provider-airbyte

Terraform Provider for Airbyte API
https://api.airbyte.com/
MIT License
42 stars 19 forks source link

Ability to update object with sync mode `incremental_deduped_history` #70

Open Santhin opened 8 months ago

Santhin commented 8 months ago

When using sync mode incremental_deduped_history, connection updates are not possible. Recreating the connection is required to make changes.

Please do NOT include a primary key configuration for this stream.

Relates to:

exactlyaaron commented 8 months ago

I think this may be related to one I opened as well: https://github.com/airbytehq/terraform-provider-airbyte/issues/31

necsord commented 8 months ago

Hi @ThomasRooney, @terencecho, would you be able to help with this one? Provider is quite unusable in current state.

Looking at the provider implementation, features of speakeasyapi.dev it looks as if it wasn't possible to define conditional parameter for request to Airbyte API, meaning Airbyte API would need to be adjusted to accept or ignore primary key parameter instead of throwing error when source already defined it.

Anyone can confirm? Would that kind of change in Airbyte API be even possible?

ThomasRooney commented 7 months ago

@necsord I can confirm your understanding. Currently Speakeasy's Provider generator assumes an API described in an OpenAPI spec is mostly idempotent. What this means in this case is that for all resource updates, we create an API request with the intersection of the full set of information that can be derived from resource state and is available in an UPDATE request (i.e. the operation with x-speakeasy-entity-operation: Connection#update in the OpenAPI spec). We do this because doing an UPDATE request lacking a property, usually is meaningful: often deleting that property from the remote resource.

I think what needs to happen here is either:

  1. The ConnectionPatchRequest in the OpenAPI spec is updated to no longer accept StreamConfiguration, but some form of StreamConfigurationPatch, with the primaryKey attribute dropped. If that happens, the provider generator would recognize that the lack of the primaryKey in the Connection Update request implies that a primaryKey is a CREATE-only attribute, and the connection would be recreated every time it changes (rather than attempted to patch a connection with one).
  2. Alternatively, the StreamConfigurationPatch could be delineated into multiple oneOf types that are distinct, to describe the different instantiations of stream configurations and avoid this problem whilst retaining behaviour: i.e. I don't know if the PK is needed when incremental_append is migrated to incremental_deduped_history in the ConnectionPatchRequest, or even if that's possible without recreating the connection.
  3. The API is updated to be idempotent.

If [1], or [2] is done with changes made to https://raw.githubusercontent.com/airbytehq/terraform-provider-airbyte/main/airbyte.yaml, make would regenerate the terraform provider with this behaviour adjusted.

I'm personally not sure what the best approach is. Speakeasy does its best to infer a terraform provider that matches the API as described by the OpenAPI spec: and where there is behaviour not really described by the OpenAPI specification (e.g. sensitive fields, hoisting, reconciliation of API requests/responses to a resource, runtime validations), to augment the OpenAPI spec with annotations that describe the desired behaviour. However we're not experts on the underlying API. Will defer to @terencecho to help guide us forward.

necsord commented 7 months ago

Thanks @ThomasRooney for your input. Re-creating connection is not an option as it would re-trigger whole sync from scratch which takes hours to complete. I'm not sure if it's possible to implement option 2 as depending on source configuration there should be a different StreamConfigurationPatch which I recon is impossible to implement in yaml file.

Looking forward to @terencecho input šŸ™šŸ»

We're also considering creating separate incremental resources configuration with dedicated StreamIncrementalConfigurationPatch-like object but that's a quick patch not a full fledged solution.

ThomasRooney commented 7 months ago

I'm not sure if it's possible to implement option 2 as depending on source configuration there should be a different StreamConfigurationPatch which I recon is impossible to implement in yaml file

The canonical way of describing these is a discriminated oneOf with object children -- where each source configuration type would live in an independent JSON Schema, and the StreamConfiguration would be defined as oneOf [StreamIncrementalAppendConfiguration, StreamIncrementalDedupedHistory, ...] and StreamConfigurationPatch would be defined as oneOf [StreamIncrementalAppendConfigurationPatch, StreamIncrementalDedupedHistoryPatch, ...]. The only extra thing we'd need to make this work is a discriminator being defined on StreamConfiguration and StreamConfigurationPatch to allow the generator to reconcile StreamIncrementalAppendConfiguration with StreamIncrementalAppendConfigurationPatch.

It's a little more complicated than what exists now in the API, but I think that would work to describe the API better

necsord commented 6 months ago

@ThomasRooney Thanks for your input, unfortunately I did not have time to check it up sooner.

Paths to the fields that will be used as primary key. This field is REQUIRED if destination_sync_mode is *_dedup unless it is already supplied by the source schema.

Source: PATCH /connections/{connectionId} API documentation

I'm not seeing a good discriminator field as I would need information about source, whether it supplies the primary_key. That would mean the only sane way is to update the API or re-do/update the terraform provider without SpeakEasy API.

Change with API would need to be raised separately within airbytehq/airbyte issues?

nurikk-sa commented 6 months ago

Same problem here, even with 0.4.1 provider version

nurikk-sa commented 6 months ago

For now can be workaround using lifecycle block

  lifecycle {
    # Workaround for https://github.com/airbytehq/terraform-provider-airbyte/issues/83
    # https://github.com/airbytehq/terraform-provider-airbyte/issues/70
    ignore_changes = [configurations.streams]
  }
necsord commented 6 months ago

@nurikk-sa What would happen when you're modifying, adding new streams to existing connection?

nurikk-sa commented 6 months ago

@nurikk-sa What would happen when you're modifying, adding new streams to existing connection?

no drift will happen due to order change, at least it will give some time until this issue is fixed

rshorser commented 6 months ago

@nurikk-sa Brilliant workaround for the ordering changes. Adding/removing streams to an existing connection (issue #70) would still require a delete/recreate of the resource right?

fabianboerner commented 6 months ago

For now can be workaround using lifecycle block

  lifecycle {
    # Workaround for https://github.com/airbytehq/terraform-provider-airbyte/issues/83
    # https://github.com/airbytehq/terraform-provider-airbyte/issues/70
    ignore_changes = [configurations.streams]
  }

Did not work for me im still getting this error.

*edit

i just put primary to [] and it worked

vincent-el commented 5 months ago

thank you @fabianboerner that was the only fix that worked for me!

to clarify, adjust the configurations like so, to make updates on the streams possible


resource "airbyte_connection" "hubspot_to_snowflake" {
  ...
  configurations = {
    streams = [
      {
        name = "contacts"
        sync_mode    = "incremental_deduped_history"
        primary_key = []
      },
      {
        name = "companies"
        sync_mode    = "incremental_deduped_history"
        primary_key = []
      },
    ]
  }
}

the downside is that every terraform plan will always show a change to the primary_key: image

fabianboerner commented 5 months ago

thank you @fabianboerner that was the only fix that worked for me!

to clarify, adjust the configurations like so, to make updates on the streams possible


resource "airbyte_connection" "hubspot_to_snowflake" {
  ...
  configurations = {
    streams = [
      {
        name = "contacts"
        sync_mode    = "incremental_deduped_history"
        primary_key = []
      },
      {
        name = "companies"
        sync_mode    = "incremental_deduped_history"
        primary_key = []
      },
    ]
  }
}

Yes but if you want to setup a connection where you canā€™t define all streams because they maybe have variable collections in mongo db you will have the same issue again. The only solution I had here was to set the connection to be recreated always but gui user will lose their setup they maybe had done later.

It makes the whole thing sometimes tedious.

Also if you didnā€™t have had setup the streams via terraform it will not ignore configuration[ā€œstreamsā€] and the default value is always full overwrite for stream update type wich works within airbyte but not against the api that throws an error.

one thing I could life with was the changes terraform would show that are not applied.

hongbo-miao commented 2 months ago

@szemek helped fix at https://github.com/airbytehq/terraform-provider-airbyte/pull/111 Hope Airbyte team can help merge it soon! ā˜ŗļø