dimagi / commcare-hq

CommCareHQ is the server backend for CommCare, the world's largest platform for designing, managing, and deploying robust, offline-first, mobile applications to frontline workers worldwide
https://www.dimagi.com/open-source/
BSD 3-Clause "New" or "Revised" License
499 stars 217 forks source link

[CEP] Migrate RepeaterRecord to SQL #28314

Open kaapstorm opened 4 years ago

kaapstorm commented 4 years ago

Abstract

Using SQL for repeat records offers significant performance improvements for Data Forwarding.

Motivation

More projects are using Data Forwarding, and the volume of payloads being sent is increasing. This increases the negative impact of current inefficiencies.

HQ is becoming more vulnerable to the effects of downtime of remote systems. As our integration offerings around FHIR, COVID-19 and vaccinations improve, these risks compound.

Specification (updated 2020-12-10)

Current state

There are two execution paths to forwarding a repeat record payload:

  1. A Django signal (e.g. on form submission, or app change) triggers a repeat record to be registered, and RepeatRecord.attempt_forward_now() is called.

  2. The check_repeaters() scheduled task iterates repeat record instances, and calls RepeatRecord.attempt_forward_now().

A series of calls, spanning tasks and the RepeatRecord and Repeater models, send the payload and handle the response or failure.

Problems with current state

There are three low-hanging problems with this approach; the first two are related to Couch, and the third one is related to hindsight:

  1. RepeatRecord instances are iterated independent of their Repeater: We iterate repeat records of paused repeaters. In the past we've managed to improve performance noticeably just by delaying the next forwarding attempt of repeat records of paused repeaters by a week instead of a day. With SQL relationships, this would be irrelevant. We could simply filter out the repeat records of paused repeaters. (And when the repeater is unpaused, we can forward its repeat records immediately, aligning behaviour with user expectations.)

  2. RepeatRecord instances are forwarded independent of their peers: e.g. If an attempt to forward a new form submission fails, its repeat record is scheduled to be retried an hour later (then three hours, nine hours, etc.). If a second form is submitted a moment after the first, we try to forward it immediately, ignorant that an attempt to the same remote service failed a moment ago. It's probably going to fail. And if it succeeds, then that payload will have been forwarded out of sequence. Again, SQL relationships make it trivial to associate repeat records with each other:

    1. We can batch forwarding attempts at consistent intervals.

    2. And we can always send payloads in the sequence in which they were registered.

  3. If we refactor RepeatRecord.fire() from a method to a function, and pull in some of the code that is spread across Repeater and RepeatRecord, we can simplify the code for sending a payload quite a bit.

Proposed changes

New models and tasks
Old repeat records

Couch code remains in place for existing repeat records. Couch repeat records are soft-migrated to SQL as they become due for retrying, and are forwarded using the new process_repeater() task.

After seven days (MAX_RETRY_WAIT) the only repeat records not yet migrated will be old succeeded or cancelled records. These can be migrated using a management command, or ignored, or deleted.

Repeat Records Report

The repeat records report would be subclassed for SQL repeat records. If a domain has SQL repeat records, the SQLRepeatRecordReport will be used.

Datadog

Couch and SQL workflows would feed the same metrics.

Rollback command

Rollback would be a two-step process:

  1. Revert the PR that creates SQLRepeatRecord instances and changes the check_repeaters() task.
  2. Use a management command to reverse the migration of Couch repeat records, and create Couch records for new SQL records.

Impact on users

The change would not interrupt the delivery of repeat records.

Old repeat records would not appear in the Repeat Records Report until they were migrated to SQL.

Impact on hosting

The process_repeater() task would use the same workers as the process_repeat_record() task did. No impact on hosting.

Deploy process

Deploy would involve 2 PRs:

  1. Models, tasks, the SQLRepeatRecordReport, and the rollback management command. This PR can be deployed early. It would not change functionality.
  2. A "switch" PR that would start creating SQLRepeatRecord instances instead of Couch RepeatRecord instances, and soft-migrate Couch repeat records.

Backwards compatibility

When HQ switches to the new SQL workflow, support would still exist for Couch repeat records.

Release timeline

The code changes should not conflict with ongoing FHIR work, but would definitely benefit that work if they were in place before the FHIR work is released. Ideally these changes should be deployed before that work, which is due to complete its first deliverables in Q1 2021.

Open questions and issues

(I will create a follow-up CEP on the topic of Repeat Record TTL.)

snopoke commented 4 years ago

Thanks for putting this together Norman. A few questions / clarifications:

kaapstorm commented 4 years ago

Hey @snopoke . Excellent questions!

snopoke commented 4 years ago

I'm pretty sure we never delete repeat records / attempts. As you know, we delete RequestLogs after 90 days (corehq.motech.repeaters.tasks.clean_logs()). I will check with the AEs, who may use the Repeat Records Report more often than I do, but 90 days seems like a reasonable amount of time to keep succeeded and cancelled repeat records / attempts too. I'd be curious to hear thoughts on whether we should delete repeat records for paused repeaters.

I don't see a lot of value in keeping repeat records and their attempts forever. Their main purpose is to provide visibility into any issue and what data is being sent. I would think that having a similar window for repeat record attempts would be a good change. This would also answer the question of what to do with records from cancelled / paused repeaters. If a repeater is paused I think we should continue to create repeat records with the understanding that they would have the same TTL window applied to them. This would give a project 90 days (or whatever the window is) to fix the repeater before they start 'losing' data.

Cancelled repeaters should not create new records (I'm not sure quite what the difference is between paused and cancelled).

kaapstorm commented 4 years ago

The feedback from AEs is that we should delete repeat records, but that we should keep repeat records that have not been delivered successfully for longer than 90 days (because sometimes it has taken longer than 90 days for a project to notice that their data has not been forwarded). They suggested 180 days.

Here is how I understand the suggestion: RepeatRecord state TTL
pending (due to be resent) registered_on + 180 days
failed (not yet due to be resent) registered_on + 180 days
cancelled (failed too many times) registered_on + 180 days
succeeded last_checked + 90 days (to match RequestLog TTL)

Repeaters only have two states: paused and active/unpaused. RepeatRecords are created for them in both states, but payloads are only forwarded when the Repeater is unpaused.

We considered creating a cancelled or stopped state for repeaters where RepeatRecords are not created for them, but now that Repeaters use ConnectionSettings to store their credentials, a user could achieve almost the same result by having a ConnectionSettings instance without a Repeater instance that uses it. If we no longer iterate RepeatRecords for paused Repeaters, and RepeatRecords get deleted after a while, that could also reduce the benefit for creating a new state. So at the moment, I'm leaning towards leaving Repeater states as they are.

snopoke commented 4 years ago

because sometimes it has taken longer than 90 days for a project to notice that their data has not been forwarded

This statement seems like something we should aim to fix rather than just giving more leeway. Having said that I think the 180 days proposal is still better than the current situation but if we document this anywhere we should say 90 days to give us room to drop from 180 in the future.

kaapstorm commented 4 years ago

Totally agree. I think the "Addresses to send notifications" field should be required.

kaapstorm commented 3 years ago

@snopoke I've updated the description to focus on repeat record models, where I think our easiest wins will be. I think conversations about Repeat Record TTL are useful and important, but might be better to spawn a separate CEP/issue/conversation for that. I'm hoping we can make progress on part of this if this CEP narrows its focus a bit.