The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
We are currently auto generating the low-code component JSON schema based on the dataclass definitions using a helper library called dataclasses-json. This has some nice benefits like always being up to date. However, the main drawback is that it couples the internal implementation to the public schema interface exposed to developers. It also leads to a very bloated and confusing schema file that is not suitable for human interpreation.
Describe the solution you’d like
As the first step in refactoring our manifest parsing implementation, we first need to handwrite a YAML schema that represents all of the declarative components that can be included in a connector manifest. This schema should be functionally equivalent to the schema that is currently being generated. We should then be able to interchange the auto-schema generation with the handwritten schema when performing manifest validation. Subsequent tickets will handle actually taking the schema written and turning it into Pydantic models.
Implementation Details
We should handwrite the schema based on the components and relationships within the declarative framework. This issue only deals with writing the language. To make sure nothing gets missed we should adhere to this checklist of components that need to be defined. For each definition, additionalProperties should be set to false to ensure extraneous fields are not being added.
The resulting schema should be stored in the airbyte-cdk package and when validating a manifest, the schema should be read in as a dictionary and used for the validation. We can remove the auto schema generation code the factory.
We need to also continue to support custom component definitions which can be accomplished by having adding additional definitions (for ex. CustomAuthenticator) and additionalProperties set to true.
Low Code Schema Refactor Phase 1
Tell us about the problem you're trying to solve
We are currently auto generating the low-code component JSON schema based on the dataclass definitions using a helper library called
dataclasses-json
. This has some nice benefits like always being up to date. However, the main drawback is that it couples the internal implementation to the public schema interface exposed to developers. It also leads to a very bloated and confusing schema file that is not suitable for human interpreation.Describe the solution you’d like
As the first step in refactoring our manifest parsing implementation, we first need to handwrite a YAML schema that represents all of the declarative components that can be included in a connector manifest. This schema should be functionally equivalent to the schema that is currently being generated. We should then be able to interchange the auto-schema generation with the handwritten schema when performing manifest validation. Subsequent tickets will handle actually taking the schema written and turning it into Pydantic models.
Implementation Details
We should handwrite the schema based on the components and relationships within the declarative framework. This issue only deals with writing the language. To make sure nothing gets missed we should adhere to this checklist of components that need to be defined. For each definition,
additionalProperties
should be set to false to ensure extraneous fields are not being added.The resulting schema should be stored in the
airbyte-cdk
package and when validating a manifest, the schema should be read in as a dictionary and used for the validation. We can remove the auto schema generation code the factory.We need to also continue to support custom component definitions which can be accomplished by having adding additional definitions (for ex.
CustomAuthenticator
) andadditionalProperties
set to true.Acceptance Criteria