This PR provide a way to ensure that the same structure is kept across all etl flows/processes. Introducing a new class called ETLProcess. The ETLProcess is a base class to inherit from when defining etl processes. When building etl processes there is a need for maintaining a standardized structure and providing common logic around the etl process as well. In addition, two base classes are provided for a parameter object (ETLParameters) and a call arguments object (ETLCallArgs).
Background: from experience a nice way to build etl processes are structured in the following manner:
Orchestrator. One file containing the orchestrators for a given job. A batch orchestrator and/or an incremental orchestrator.
Parameters. One file containing all parameterization around the job. I.e., name of sink table, secret names, columns to merge on, etc.
Transformer. Optionally one or more files containing business specific logic to transform the data.
Outcome: when inheriting from the ETLProcess base class the following apply:
The underlying etl process must conform to a certain structure. Read - must implement certain methods.
Call arguments (job command line arguments from Databricks) are being parsed by one uniform standard (the job parameter helper utility provided with this PR as well).
When wanting to add further logic across all etl processes it can be done by adding to the ETLProcess base class. This makes refactoring around the etl process easier as well.
When testing it is possible to provide test-specific ETLParameter- and ETLCallArgs objects. This gives the ability “mock” different parameters and/or call arguments when testing.
Feel free to reach out for more in depth explanation and discussion.
Martin Bøge
This PR provide a way to ensure that the same structure is kept across all etl flows/processes. Introducing a new class called ETLProcess. The ETLProcess is a base class to inherit from when defining etl processes. When building etl processes there is a need for maintaining a standardized structure and providing common logic around the etl process as well. In addition, two base classes are provided for a parameter object (ETLParameters) and a call arguments object (ETLCallArgs).
Background: from experience a nice way to build etl processes are structured in the following manner:
Outcome: when inheriting from the ETLProcess base class the following apply:
Feel free to reach out for more in depth explanation and discussion. Martin Bøge