[CEP] Migrate RepeaterRecord to SQL

kaapstorm commented 4 years ago

Abstract

Using SQL for repeat records offers significant performance improvements for Data Forwarding.

Motivation

More projects are using Data Forwarding, and the volume of payloads being sent is increasing. This increases the negative impact of current inefficiencies.

HQ is becoming more vulnerable to the effects of downtime of remote systems. As our integration offerings around FHIR, COVID-19 and vaccinations improve, these risks compound.

Specification (updated 2020-12-10)

Current state

There are two execution paths to forwarding a repeat record payload:

A Django signal (e.g. on form submission, or app change) triggers a repeat record to be registered, and RepeatRecord.attempt_forward_now() is called.
The check_repeaters() scheduled task iterates repeat record instances, and calls RepeatRecord.attempt_forward_now().

A series of calls, spanning tasks and the RepeatRecord and Repeater models, send the payload and handle the response or failure.

Problems with current state

There are three low-hanging problems with this approach; the first two are related to Couch, and the third one is related to hindsight:

RepeatRecord instances are iterated independent of their Repeater: We iterate repeat records of paused repeaters. In the past we've managed to improve performance noticeably just by delaying the next forwarding attempt of repeat records of paused repeaters by a week instead of a day. With SQL relationships, this would be irrelevant. We could simply filter out the repeat records of paused repeaters. (And when the repeater is unpaused, we can forward its repeat records immediately, aligning behaviour with user expectations.)
RepeatRecord instances are forwarded independent of their peers: e.g. If an attempt to forward a new form submission fails, its repeat record is scheduled to be retried an hour later (then three hours, nine hours, etc.). If a second form is submitted a moment after the first, we try to forward it immediately, ignorant that an attempt to the same remote service failed a moment ago. It's probably going to fail. And if it succeeds, then that payload will have been forwarded out of sequence. Again, SQL relationships make it trivial to associate repeat records with each other:
1. We can batch forwarding attempts at consistent intervals.
2. And we can always send payloads in the sequence in which they were registered.
If we refactor RepeatRecord.fire() from a method to a function, and pull in some of the code that is spread across Repeater and RepeatRecord, we can simplify the code for sending a payload quite a bit.

Proposed changes

New models and tasks

Create SQL models for RepeatRecord, SQLRepeatRecord and SQLRepeatRecordAttempt.
Create SQLRepeaterStub only to link RepeatRecords. Repeater data and functionality would remain in existing Couch models. For now, Repeaters are not migrated to SQL.
Repeater.register() would create a corresponding SQLRepeaterStub if it doesn't exist, add a new SQLRepeatRecord instance, and pass the SQLRepeaterStub (not the repeat record) to an attempt_forward_now() function.
The repeater is retried later if a remote service is unavailable, not the repeat record. attempt_forward_now() calls a process_repeater() task, which iterates repeat records in the order in which they were registered, and sends their payloads. Repeaters are processed concurrently by the process_repeater() task. The process_repeater() task is assigned to the repeat record queue, which uses green threads on Production. (Production has 200 repeat record queue workers, and just under 1400 repeaters, although probably fewer than half of the repeaters are in use.)
Instead of two Couch requests for every repeat record, SQL repeat records are fetched in chunks of 1000.

Old repeat records

Couch code remains in place for existing repeat records. Couch repeat records are soft-migrated to SQL as they become due for retrying, and are forwarded using the new process_repeater() task.

After seven days (MAX_RETRY_WAIT) the only repeat records not yet migrated will be old succeeded or cancelled records. These can be migrated using a management command, or ignored, or deleted.

Repeat Records Report

The repeat records report would be subclassed for SQL repeat records. If a domain has SQL repeat records, the SQLRepeatRecordReport will be used.

Datadog

Couch and SQL workflows would feed the same metrics.

Rollback command

Rollback would be a two-step process:

Revert the PR that creates SQLRepeatRecord instances and changes the check_repeaters() task.
Use a management command to reverse the migration of Couch repeat records, and create Couch records for new SQL records.

Impact on users

The change would not interrupt the delivery of repeat records.

Old repeat records would not appear in the Repeat Records Report until they were migrated to SQL.

Impact on hosting

The process_repeater() task would use the same workers as the process_repeat_record() task did. No impact on hosting.

Deploy process

Deploy would involve 2 PRs:

Models, tasks, the SQLRepeatRecordReport, and the rollback management command. This PR can be deployed early. It would not change functionality.
A "switch" PR that would start creating SQLRepeatRecord instances instead of Couch RepeatRecord instances, and soft-migrate Couch repeat records.

Backwards compatibility

When HQ switches to the new SQL workflow, support would still exist for Couch repeat records.

Release timeline

The code changes should not conflict with ongoing FHIR work, but would definitely benefit that work if they were in place before the FHIR work is released. Ideally these changes should be deployed before that work, which is due to complete its first deliverables in Q1 2021.

Open questions and issues

What should a comprehensive plan to QA this migration cover?
Scale: There are about 1400 repeaters and over 72 million repeat records on Production. If we switch on a Monday, there will be the fewest number of repeat records to soft-migrate. If we want to migrate old succeeded and cancelled repeat records, it would best be done over the following weekend when we could use idle repeat record queue workers. The migration of 72 million repeat records will take several days. Do we want to migrate them all? Do we want a cut-off age, e.g. only migrate those less than a year old?

(I will create a follow-up CEP on the topic of Repeat Record TTL.)

snopoke commented 4 years ago

Thanks for putting this together Norman. A few questions / clarifications:

This seems to cover a few different things. The concurrency suggestions seem like they would be independent of the migration to SQL.
We already use gevent for the repeater celery workers. Would the concurrency changes you are proposing replace that or work in conjunction?
how long do we keep repeat records / attempts for? It seems important to me that we should not keep these records forever. If we didn't keep them forever it would make the migration easier since we could just wait until all old records are removed instead of migrating them to SQL.

kaapstorm commented 4 years ago

Hey @snopoke . Excellent questions!

The concurrency suggestions are independent, but they build on the migration to SQL -- I don't think it would make sense to try to implement them before migrating to SQL, because I think iterating RepeatRecords more accurately should be a prerequisite.
I had completely forgotten that Celery can use gevent, and that it uses gevent for the repeat record queue (whose only task is process_repeat_record()). ... That is excellent! And it means that the benefits of using process pool workers + aiohttp would be much less that I had believed, if there would be any benefit at all. ... Given that, and that I think it is obvious that a SQL migration should use a parallel workflow, and should iterate RepeatRecords using related models, it seems this CEP is not really necessary.
I'm pretty sure we never delete repeat records / attempts. As you know, we delete RequestLogs after 90 days (corehq.motech.repeaters.tasks.clean_logs()). I will check with the AEs, who may use the Repeat Records Report more often than I do, but 90 days seems like a reasonable amount of time to keep succeeded and cancelled repeat records / attempts too. I'd be curious to hear thoughts on whether we should delete repeat records for paused repeaters.

snopoke commented 4 years ago

I'm pretty sure we never delete repeat records / attempts. As you know, we delete RequestLogs after 90 days (corehq.motech.repeaters.tasks.clean_logs()). I will check with the AEs, who may use the Repeat Records Report more often than I do, but 90 days seems like a reasonable amount of time to keep succeeded and cancelled repeat records / attempts too. I'd be curious to hear thoughts on whether we should delete repeat records for paused repeaters.

I don't see a lot of value in keeping repeat records and their attempts forever. Their main purpose is to provide visibility into any issue and what data is being sent. I would think that having a similar window for repeat record attempts would be a good change. This would also answer the question of what to do with records from cancelled / paused repeaters. If a repeater is paused I think we should continue to create repeat records with the understanding that they would have the same TTL window applied to them. This would give a project 90 days (or whatever the window is) to fix the repeater before they start 'losing' data.

Cancelled repeaters should not create new records (I'm not sure quite what the difference is between paused and cancelled).

kaapstorm commented 4 years ago

The feedback from AEs is that we should delete repeat records, but that we should keep repeat records that have not been delivered successfully for longer than 90 days (because sometimes it has taken longer than 90 days for a project to notice that their data has not been forwarded). They suggested 180 days.

Here is how I understand the suggestion: RepeatRecord state	TTL
pending (due to be resent)	`registered_on` + 180 days
failed (not yet due to be resent)	`registered_on` + 180 days
cancelled (failed too many times)	`registered_on` + 180 days
succeeded	`last_checked` + 90 days (to match RequestLog TTL)

Repeaters only have two states: paused and active/unpaused. RepeatRecords are created for them in both states, but payloads are only forwarded when the Repeater is unpaused.

We considered creating a cancelled or stopped state for repeaters where RepeatRecords are not created for them, but now that Repeaters use ConnectionSettings to store their credentials, a user could achieve almost the same result by having a ConnectionSettings instance without a Repeater instance that uses it. If we no longer iterate RepeatRecords for paused Repeaters, and RepeatRecords get deleted after a while, that could also reduce the benefit for creating a new state. So at the moment, I'm leaning towards leaving Repeater states as they are.

snopoke commented 4 years ago

because sometimes it has taken longer than 90 days for a project to notice that their data has not been forwarded

This statement seems like something we should aim to fix rather than just giving more leeway. Having said that I think the 180 days proposal is still better than the current situation but if we document this anywhere we should say 90 days to give us room to drop from 180 in the future.

kaapstorm commented 4 years ago

Totally agree. I think the "Addresses to send notifications" field should be required.

kaapstorm commented 3 years ago

@snopoke I've updated the description to focus on repeat record models, where I think our easiest wins will be. I think conversations about Repeat Record TTL are useful and important, but might be better to spawn a separate CEP/issue/conversation for that. I'm hoping we can make progress on part of this if this CEP narrows its focus a bit.

dimagi / commcare-hq