DataWorkz-NL / KubeETL

ETL controller for Kubernetes
Apache License 2.0
4 stars 0 forks source link

Define a Source/Sink API #1

Closed Blokje5 closed 3 years ago

Blokje5 commented 4 years ago

KubeETL should make it easy for Data Engineers/Data Scientist to create ETL pipelines. This requires connection configuration. Often as ETL projects scale, source/sink configuration can become a mess.

By providing an API Kind for Sources/Sinks (or Connectors?) we can add the following to the project:

Eventually we can also add more complex functionality, such as regularly scheduled Data Quality checks on sources.

A basic Source/Sink should at least contain the following information:

For now there is no need for a controller, although that could change in the future. We just use the API object as a way to store information.

ThijsKoot commented 4 years ago

Perhaps we should distinguish Connectors and Sources/Sinks, with Connectors being the service/server/whatever hosting multiple Sources/Sinks. This would tidy up Authentication-coupling: Authentication is connected to a Connector, eliminating the need to define Authentication-info for each Source/Sink.

This setup could also reduce complexity as one can opt to just use Connector/Authentication without specifying Source/Sink, in cases where Source/Sink-concepts are either not applicable or not implemented.

This would create the following Kinds: