dimagi / commcare-hq

CommCareHQ is the server backend for CommCare, the world's largest platform for designing, managing, and deploying robust, offline-first, mobile applications to frontline workers worldwide
https://www.dimagi.com/open-source/
BSD 3-Clause "New" or "Revised" License
498 stars 217 forks source link

[CEP] Split data warehouse App Status DAG into smaller units #25468

Closed snopoke closed 5 years ago

snopoke commented 5 years ago

Abstract

Split up the App Status DAG in the data warehouse into smaller, more atomic units.

Motivation

Each DAG in the warehouse consists of a multiple steps. Each step must complete successfully in order for the entire DAG to be marked as a success. The more steps there are in a DAG the more chance there is of failure. If a DAG fails, each task is retried from the beginning meaning that any incremental successes were lost.

The App Status DAG is particularly susceptible to this problem since the steps are very long and there is a lot of work that get's lost when the process fails.

Specification

The current aggregation workflow for the App Status data is as follows:

app status full

Failure at any stage required completely re-loading all the staging data and re-executing any of the previous ETL steps, regardless of whether they were successful or not.

To solve the problem the workflow should be split up in to 4 loosely coupled workflows:

Form and Synclog workflow

These are very straight forward worflows that follow the simple pattern of loading data into staging tables and then writing that data into the fact table with some transformation and linking.

form synclog

App status workflows

The App Status workflow differs only in that the data to populate the staging tables is taken from other fact tables instead of from CommCare HQ models.

app status spearated

The requirements for making this work are as follows:

  1. All data required to generate each of the app status workflows must be available in the respective fact table.

    • This is already the case
  2. A mechanism is required that will allow the app status workflows to only process new data (data that has not been processed since the last successful run). This appears to be the main reason that data is taken from the form and synclog staging tables since those tables only contain data from the most recent batch.

To solve the second requirement we can use the Batch records created for the form and synclog processes. The fact tables already contain an indexed column with the batch ID which is updated whenever a row is changed. We can use this to filter the fact tables for only data that has been changed since the last successful run of the App Status workflow.

The pseudo code below illustrates this concept:

app_status_form_batch_from_date = A
app_status_form_batch_to_date = B

form_fact_batches = get_successful_batch_ids('form_fact', gt=A, lte=B)

# filter fact data based on batch IDs
populate_app_status_form_staging(form_fact_batches)

In addition to filtering on batch_id we will also need to add a filter on received_on which is used to partition the form_fact table. This will allow the query planner to ignore partition tables out of range.

app_status_form_batch_from_date = A
app_status_form_batch_to_date = B

form_fact_batches = get_successful_batches('form_fact', gt=A, lte=B)

earliest_received_on = form_fact_batches[0].start_datetime

form_fact_batch_ids = [batch.id for batch in form_fact_batches]
# filter fact data based on batch IDs
populate_app_status_form_staging(form_fact_batch_ids, received_on__gte=earliest_received_on)

Impact on users

This is an internal change that should not impact end users except that it should make the warehouse ETL process more reliable.

Impact on hosting

Since these changes need to be made across CommCare HQ as well as Airflow it will be necessary to do a multi-stage rollout. However since only Dimagi controlled environments are running the data warehouse this can be fully controlled by Dimagi so there will be no extended support windows required.

Backwards compatibility

The multi-phase rollout will ensure backwards compatibility until all environments have been upgraded. The rollout process is described below:

  1. Create and deploy new ETL classes for the updated workflow.
  2. Update airflow to use the new workflow and ETL classes.
  3. Remove old ETL classes.

Release Timeline

No specific timeline.

Open questions and issues

None

calellowitz commented 5 years ago

Love both the new CEP format and this idea. Seems like a great improvement, especially if we can make the form fact components more incremental

sravfeyn commented 5 years ago

This is great idea!

Both app status synclog fact and app status form fact are same rows, right? Or are they going to create separate rows?

snopoke commented 5 years ago

@sravfeyn they update the same rows but in separate queries: https://github.com/dimagi/commcare-hq/blob/9dde804ff259bab84268ebedb3bd6497764640e0/corehq/warehouse/transforms/sql/app_status_fact.sql

emord commented 5 years ago

I started looking into whether or not we even need the warehouse. I think that it made sense when we were planning to have all reports pull from it, but I'm not sure its worth the overhead if its only going to be the app status report.

I'm interested to hear your thoughts on https://github.com/dimagi/commcare-hq/pull/25594 as a potential way to remove the extra work/infrastructure needed by the warehouse. I think that the pros outweigh the cons, but not sure if I'm missing something important

emord commented 5 years ago

I added an ADR for removing the warehouse in https://github.com/dimagi/commcare-hq/pull/25652.

Should this issue be closed out?