airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
15.5k stars 3.99k forks source link

Cache RegistryEntries for OSS and Cloud connector versions #26878

Closed bnchrch closed 1 year ago

bnchrch commented 1 year ago

Problem

For any OLD connector version, how do we get the registry entry that was in the catalog but no longer is?

Solution

Lets update our registry generator to cache connector entries

For example if you are looking for the last registry entry for destination-bigquery:1.2.19 on cloud you would go to

https://connectors.airbyte.com/files/metadata/airbyte/destination-bigquery/1.2.19/cloud.json


OLD CONVERSATION BELOW


The Current System

  1. Metadata files only allow for registry specific overrides
  2. The Connector Registries can only have one connector version at a time. (OSS can have source-s3:1.2.3 and cloud can have source-s3:2.0.0
  3. Platform can only have one default version of a Actor Definition at a time (This directly relates to the version in the registry)
  4. The Platform now allows for individual Actors to reference other Actor Definition Versions other that the default. source-s3:eds-prerelease-123
  5. We want to extend Actor Definition Versions so that we can allow users remain on old versions for a set window of time.
  6. Populating Actor Definition Versions, at this moment, are an awkward combination of current metadata, version specific spec cache, and whatever data was ingested on that version if it ever originally appeared in the Registry.

Questions for the team

The 6th point is the awkward turtle and we need to answer the question(s):

  1. If we move a connector on [oss|cloud] from v1 to v2 with a 6 month deprecation window, how do we ensure we freeze v1s metadata related settings in time
  2. If we add a connector + version to [oss|cloud] that the platform has never seen, how do we populate its metadata related values?

Related Conversations

Normalization Overrides: https://airbytehq-team.slack.com/archives/C03VDJ4FMJB/p1685552842556829 Rollbacks: https://airbytehq-team.slack.com/archives/C03VDJ4FMJB/p1685567791734259

erohmensing commented 1 year ago

This goes for any place in which we don't already have an actor definition - the /update endpoints where OSS users can choose any version they like also apply here

erohmensing commented 1 year ago

If we move a connector on [oss|cloud] from v1 to v2 with a 6 month deprecation window, how do we ensure we freeze v1s metadata related settings in time

If everything lives on the ADV and not the actor definition, the breaking changes phase 0 should handle this correctly. What is currently not handling this correctly is the fact that we upsert the ADVs. There are 2 issues here

If we add a connector + version to [oss|cloud] that the platform has never seen, how do we populate its metadata related values?

If we do this via override or /update endpoint, I think we need to request this info from the metadata service.

evantahler commented 1 year ago

Grooming:

ADV Properties in Question:


Options:

  1. All the above are also needed in LD feature flags
    • "The LD flags are 'registry #3'" - which implies that these flags can contain all the ADV properties needed. "dockerTag" is just one of many bits of info that an ADV needs.
    • This means the LD feature flags are going to get more complex.
    • Old Tag -> v1.2.3, New Tag -> {dockerTag: "v1.2.3", releaseStage: "generally_available"} - now in LD you provide JSON
  2. Pre-releases publish metadata file, and platform consumes these per-version files and builds ADV entry
    • Then, dagster bundles pre-releases into a pre-release registry file (tagged by commimt or the docker prelease tag - oss_registry-GIT_BRANCH)
    • The reason to use the git branch is because you might change more than one connector
    • The registry is what you are picking in LD now, not the tag of the docker image now.

TODO:

--> SOMEONE WRITE A SMALL TECH SPEC PICKING WHAT WE SHOULD DO

evantahler commented 1 year ago

https://github.com/airbytehq/airbyte/issues/27077 for not updating ADV part

pedroslopez commented 1 year ago

In my mind, the "entry point" for a connector version from our registry should be its metadata - and the docker image just happens to be how we run that connector. So, to have a version override that says "v1.2.3" means go grab everything for this version (not just a "docker image override").

In the platform today when processing overrides we go to the spec cache bucket to grab the spec for the given version. I think this should be replaced with looking at our registry and pulling all the version-specific information from our versioned metadata.

If this were an API, I would imagine something like /connectors/source-faker/versions/1.2.3 and getting that info. With the current model of serving versioned metadata in files, this is sort of already available in YAML form at https://connectors.airbyte.com/files/metadata/airbyte/source-faker/2.1.0/metadata.yaml - though what's a bit weird is this isn't the exact contents that would have been returned in the registry served at https://connectors.airbyte.com/files/registries/v0/cloud_registry.json. There's some processing that happens and we fill in the spec and I'm not sure what else.

In any case, I don't think this is a full registry we need (i.e. a list) or any seeding that we have to do platform side. I think what we need is more a way to request a "registry entry" for a specific verison on demand.

The logic on the platform should follow the same as what happens for spec today, just filling in more information. As a refresher, when an override is found in LaunchDarkly, we:

  1. Check to see if it's already in the db. If it is, we use that.
  2. If it's not, we: i. Pull the spec from the spec cache bucket ii. Construct a new ADV with the above spec, requested image tag and fill in the rest of the fields from the default version iii. Persist the new version and use that.

So, we should replace step 2i with something that provides the rest of the fields (the metadata) and the rest should just work.

erohmensing commented 1 year ago

I have a similar worldview as the above! It would make the platform rely on our service a lot more, but I think that's probably an inevitability?

We should do the same if we get an OSS /update to a version that doesn't exist already in the DB (e.g. prereleases) - the user would have to be online to grab the prerelease metadata, but they'd have to be so in order to get the docker image too. If they can hit the api when they pull the docker image, we could find a way to make that work

bnchrch commented 1 year ago

Grooming notes:

evantahler commented 1 year ago

Grooming: