edx / edx-arch-experiments

A plugin to include applications under development by the architecture team at edx
GNU Affero General Public License v3.0
0 stars 3 forks source link

edx-platform deployment pipeline/monitoring training #366

Open davidjoy opened 1 year ago

davidjoy commented 1 year ago

A/C

We want other squads to be able to help maintain and monitor the edx-platform deployment pipeline in order to reduce our on-call burden and upskill our colleagues.

This probably looks like:

robrap commented 1 year ago

Additional thoughts:

  1. Who determines if the pipeline is fast enough and would invest in speeding it up?
  2. Who determines if the pipeline is stable enough and who would invest in improving flakiness?
  3. How do we determine common issues? Ideally this would be automated, and would not rely on various teams following some manual process for every failure to catalog issues.

Note that these issues are bad with a single team, and will become worse with split-ownership, so it would be best if we could improve for everyone before we split.

robrap commented 1 year ago

It's becoming clear to me that our current runbook has some deficiencies:

  1. Should the first step simple be to rerun any failed stage, given how many flaky issues we see?
  2. At what point do we ticket a flaky issue?
  3. How do we gather data on each flaky issue to know in what order they should be addressed? Which are happening the most in recent history?
robrap commented 1 year ago

Waiting on feedback or a Parking Lot discussion around maintenance.

robrap commented 1 year ago

After discussion in Parking Lot, we wondered:

  1. Do we need to improve stability of the pipeline before even trying this?
  2. Is moving on to ArgoCD a prerequisite for minimizing training on a different pipeline for other teams (if the pipeline becomes more like others)?
  3. Is this really worth having split ownership from a 2U perspective?

I'm marking blocked for now, but we may potentially close this.

If we close this ticket, we should:

@jmbowman: Should this remain blocked or should be close?

robrap commented 1 year ago

We are moving this to the backlog until edx-platform is containerized, we've moved on to ArgoCD, and we can reassess then?