gchq / sleeper

A cloud-native, serverless, scalable, cheap key-value store
Apache License 2.0
62 stars 9 forks source link

Avoid case where creation of CompactionJobStatus fails so no status updates are returned #251

Open kr565370 opened 1 year ago

kr565370 commented 1 year ago

There's a failure case when the job gets put on the SQS queue but then the call to save the create status update in DynamoDB fails. If that happens any future status updates will not be returned in any reports. It would be nice if we could avoid that state somehow.

patchwork01 commented 1 year ago

There are two sides to this. One for the behaviour of the compaction job creator and tasks when they fail to save a status update. One for the behaviour of the reporting client when there are jobs in the system whose status updates aren't all in DynamoDB.

I suspect the ideal would be for the compaction job creator and tasks to behave as though the process of creating or running each job all happened in a single transaction. We could try to recreate the behaviour of a real transaction that could be rolled back, but that would be quite difficult due to the two generals problem. We could look at a saga pattern, or make the reporting status updates the real source of truth with an event sourced model. With the latter option, we'd need to think about the case where the status update store is disabled.

It seems a bit strange to complicate the system for the sake of reporting. One alternative would be to accept that the reporting could sometimes be incomplete, and ensure that the system doesn't fail if the reporting fails. We would need to ensure the reporting client can display states where the reporting data is invalid or incomplete. This would also have the significant disadvantage of making the reporting less accurate.