FlowETL's current approach for adding QA checks to a DAG is to scan through all files in the template_searchpath, and create QACheckOperator tasks for all *.sql templates that contain "qa_checks" somewhere in the (relative) filepath. This is convenient for dynamically picking up QA checks, but has some downsides:
The approach of searching through all templates is a little fragile. In particular, if the same QA check file appears relative to multiple searchpaths (e.g. if both /path/to/dag_folder and /path/to/dag_folder/subdir are in the template searchpath), that QA check will be picked up twice.
The requirement for "qa_checks" to appear in the relative path can lead to some non-intuitive behaviour - e.g. if I define a QA check in /my/additional/qa_checks/my_check.sql and then set additional_qa_check_paths=/my/additional/qa_checks/, my QA check will not be picked up.
Because the default QA checks are not in the default template searchpath, we implicitly add <flowetl_module_install_path>/qa_checks to the template searchpath. Since the template searchpath setting applies to the entire DAG, there's potential for this to have unintended side-effects on template fetching in other tasks within the DAG.
Aside from these potential pitfalls, there are other features that might be nice to have which are not possible with the current approach. E.g.:
Excluding some QA checks (e.g. if some of the default QA checks are not applicable because they apply to a field that isn't in the data we're ingesting, or perhaps we want to run some of the checks at one stage in a DAG and the rest at a later stage)
We may perhaps want to override one of the default QA checks with a modified check in some circumstances
It would be good to re-design the QA check discovery mechanism to avoid these shortcomings and allow more explicit control over check discovery when it's useful, without losing all of the convenience of the dynamic QA check discovery and injection of default QA checks.
Ideally the solution here would no longer rely on us knowing the location of the DAG folder during DAG creation, so that we can remove the hack introduced in https://github.com/Flowminder/FlowKit/pull/6496.
FlowETL's current approach for adding QA checks to a DAG is to scan through all files in the
template_searchpath
, and createQACheckOperator
tasks for all*.sql
templates that contain"qa_checks"
somewhere in the (relative) filepath. This is convenient for dynamically picking up QA checks, but has some downsides:/path/to/dag_folder
and/path/to/dag_folder/subdir
are in the template searchpath), that QA check will be picked up twice."qa_checks"
to appear in the relative path can lead to some non-intuitive behaviour - e.g. if I define a QA check in/my/additional/qa_checks/my_check.sql
and then setadditional_qa_check_paths=/my/additional/qa_checks/
, my QA check will not be picked up.<flowetl_module_install_path>/qa_checks
to the template searchpath. Since the template searchpath setting applies to the entire DAG, there's potential for this to have unintended side-effects on template fetching in other tasks within the DAG.Aside from these potential pitfalls, there are other features that might be nice to have which are not possible with the current approach. E.g.:
It would be good to re-design the QA check discovery mechanism to avoid these shortcomings and allow more explicit control over check discovery when it's useful, without losing all of the convenience of the dynamic QA check discovery and injection of default QA checks.