Handle multiple staging hubs per cluster in our CI

yuvipanda commented 6 months ago

@yuvipanda this also deviates from an (unwritten) policy that we tend to have a 1:1 mapping of clusters and staging hubs (2i2c GCP cluster being the exception but mostly for our demos, not serving communities). We should evolve that policy if multiple staging hubs per cluster is something we want to take on and codify it somewhere.

Originally posted by @sgibson91 in https://github.com/2i2c-org/infrastructure/issues/3984#issuecomment-2074471659

We already have a few places where there are multiple staging hubs per cluster:

ucmerced and ucmerced-staging in the 2i2c shared cluster
r-staging and r-prod, along with a staging and prod in the utoronto cluster
https://github.com/2i2c-org/infrastructure/issues/3984

I believe the current assumption is 'one staging hub per cluster'. Instead, we should move to a model where a hub can possibly have a staging hub associated with it, rather than a cluster having a staging hub associated with it.

consideRatio commented 6 months ago

Instead, we should move to a model where a hub can possibly have a staging hub associated with it, rather than a cluster having a staging hub associated with it.

I've written a few things async below to help know what I'm thinking about here, but I'm not ready to commit to an unplanned async followup discussions since its unplanned work that I do very inefficiently async - happy to chat unplanned sync about this though!

Undefined expectations for both engineering and community

This is the key feedback I have - what does it mean to have a staging hub for a community hub in a dedicated/shared cluster respectively, as compared to having just "another hub" in the cluster - I'd like to see this more cleary defined.

Misc complexity thoughts

Continuous deployment complexity

The assumptions on the cluster wide staging hub for is that it should block deployment of prod hubs if it fails. What is the expectations for r-staging though? Should r-staging block failure block r-prod deploy - it doesn't currently. To support functionality like this in our current CD setup is very complicated or impossible without compromising notably on total execution time.

Cloud cost complexity

The assumption I guessed in shared billing for ucmerced-staging and ucmerced (GCP) in 2i2c's shared cluster was that costs gets combined into a single hub's cloud costs. This wasn't causing unfairness for other communities, but if ucmerced + staging was in AWS, each hub adds pods which in the end forces additional core nodes adding costs - making it an actual cost to have an almost entirely idle hub.

To have a staging hubs for hubs also adds some administrative costs and complexity.

sgibson91 commented 6 months ago

My main concern in implementing this is around the CI/CD setup as well. Right now if any staging hub (where there are multiple) fails, then no production hubs will be deployed. So if we go ahead with this, then we need to figure out a sensible way to link a staging and prod hub together, and also what that conceptually means.

2i2c-org / infrastructure