Closed snopoke closed 5 years ago
Love both the new CEP format and this idea. Seems like a great improvement, especially if we can make the form fact components more incremental
This is great idea!
Both app status synclog fact
and app status form fact
are same rows, right? Or are they going to create separate rows?
@sravfeyn they update the same rows but in separate queries: https://github.com/dimagi/commcare-hq/blob/9dde804ff259bab84268ebedb3bd6497764640e0/corehq/warehouse/transforms/sql/app_status_fact.sql
I started looking into whether or not we even need the warehouse. I think that it made sense when we were planning to have all reports pull from it, but I'm not sure its worth the overhead if its only going to be the app status report.
I'm interested to hear your thoughts on https://github.com/dimagi/commcare-hq/pull/25594 as a potential way to remove the extra work/infrastructure needed by the warehouse. I think that the pros outweigh the cons, but not sure if I'm missing something important
I added an ADR for removing the warehouse in https://github.com/dimagi/commcare-hq/pull/25652.
Should this issue be closed out?
Abstract
Split up the App Status DAG in the data warehouse into smaller, more atomic units.
Motivation
Each DAG in the warehouse consists of a multiple steps. Each step must complete successfully in order for the entire DAG to be marked as a success. The more steps there are in a DAG the more chance there is of failure. If a DAG fails, each task is retried from the beginning meaning that any incremental successes were lost.
The App Status DAG is particularly susceptible to this problem since the steps are very long and there is a lot of work that get's lost when the process fails.
Specification
The current aggregation workflow for the App Status data is as follows:
Failure at any stage required completely re-loading all the staging data and re-executing any of the previous ETL steps, regardless of whether they were successful or not.
To solve the problem the workflow should be split up in to 4 loosely coupled workflows:
Form and Synclog workflow
These are very straight forward worflows that follow the simple pattern of loading data into staging tables and then writing that data into the fact table with some transformation and linking.
App status workflows
The App Status workflow differs only in that the data to populate the staging tables is taken from other fact tables instead of from CommCare HQ models.
The requirements for making this work are as follows:
All data required to generate each of the app status workflows must be available in the respective fact table.
A mechanism is required that will allow the app status workflows to only process new data (data that has not been processed since the last successful run). This appears to be the main reason that data is taken from the form and synclog staging tables since those tables only contain data from the most recent batch.
To solve the second requirement we can use the Batch records created for the form and synclog processes. The fact tables already contain an indexed column with the batch ID which is updated whenever a row is changed. We can use this to filter the fact tables for only data that has been changed since the last successful run of the App Status workflow.
The pseudo code below illustrates this concept:
In addition to filtering on
batch_id
we will also need to add a filter onreceived_on
which is used to partition theform_fact
table. This will allow the query planner to ignore partition tables out of range.Impact on users
This is an internal change that should not impact end users except that it should make the warehouse ETL process more reliable.
Impact on hosting
Since these changes need to be made across CommCare HQ as well as Airflow it will be necessary to do a multi-stage rollout. However since only Dimagi controlled environments are running the data warehouse this can be fully controlled by Dimagi so there will be no extended support windows required.
Backwards compatibility
The multi-phase rollout will ensure backwards compatibility until all environments have been upgraded. The rollout process is described below:
Release Timeline
No specific timeline.
Open questions and issues
None