NHSDigital / software-engineering-quality-framework

🏎️ Shared best-practice guidance & tools to support software engineering teams
147 stars 35 forks source link

Expand on serverless architecture deployment strategies #293

Open walteck opened 1 year ago

walteck commented 1 year ago

@stefaniuk raised some interesting points around considerations for Servless architecture deployment strategies. https://github.com/NHSDigital/software-engineering-quality-framework/pull/291#discussion_r1137723837 It would be useful to explore these further.

regularfry commented 1 year ago

(Picking this up because it's worth expanding on the comment, on the assumption that these discussions haven't continued somewhere I haven't got access to)

The example is this:

...high-risk deployments are seen more often in the serverless architecture[...] This is due to a combination of [Function->Queue->Function->Persistience (not as a golden source, but distributed with an eventual consistency) ] * N in the workflow implementation. In some cases, it might not be viable or cost-effective to implement a safe rollback functionality that covers all possibilities - data changes in flight and the state of the records in the system. In such circumstances, the only feasible approach could be to make the data workflows idempotent, release more often and smaller changes to de-risk, and "move codebase forward" in case of a bug or an issue.

Let's say we've got a bunch of components at version 42 currently running in production, and we're following AWS guidelines about function aliases. If we release version 43 of all those components as PROD, we know there'll be data from version 42 kicking about in queues when those new components go live. That means version 43 has to be able to handle version 42 data correctly anyway. The easy way to do that is for version 43 of each component to include all the v42 code paths, which it knows to call because of schema versions on the data itself.

For extra added fun, it's perfectly possible that some elements early in a chain will see PROD as v43, while later ones will see it as v42 - I don't know enough about how atomic or monotonic lambda alias updates are to rule that out. Assuming eventual consistency there is the safer option. So we may well need to handle the situation where v42 components need to handle data produced by v43 components. And that means we need rules about backward and forward data schema compatibility within components that look a lot like how you'd handle database migrations for zero downtime.

Does all that make sense, or am I off-base? Do we have concrete examples of this to pull from?