bcgov / digital-journeys

PSA Forms System
https://bcgov.github.io/digital-journeys/
Apache License 2.0
8 stars 7 forks source link

Patroni outages - the broken latest image version #1556

Closed iman-jamali-fw closed 9 months ago

iman-jamali-fw commented 9 months ago

As a user of the API, Camunda, and Keycloak relying on the Patroni-managed Postgres database, I want to resolve the issue with the latest Patroni image causing downtime in the components.

Acceptance Criteria:

  1. Find potential ways to prevent the issue from occurring again.
  2. Explore the possibility of locking Patroni to a stable and working version.
  3. Test the identified solution in a non-production environment to ensure compatibility with the existing components.
  4. Document the steps taken and the solution implemented for future reference.
warrenchristian1telus commented 9 months ago

I'm still not sure what could have possibly caused the image to revert after the manual update last week. This root cause is that the provided Patroni image was changed and broken. Cluster upgrades triggered a refresh of the image, which migrated it to production without any review/approval.

The image in question is: patroni-postgres:12.4-latest

The new image (working) is hosted with Artifactory: artifacts.developer.gov.bc.ca/bcgov-docker-local/patroni-postgres:2.0.1-12.4-latest

The new image is also out of our control, and could be changed at any time. To ensure full control over releases and production migrations, we will need to create and maintain our own artifactory project, including every image for every pod/service. This will include dev, test and prod versions so that we can maintain our standard pipeline and approval process for reviewing and implementing version updates.

This is 10 images x 3 environments, so 30 images will need to be created, tracked, updated and tested for vulnerabilities and acceptable functionality. This will likely need to be done at least twice per year per image. I recommend that we investigate a way we can automate this process where possible. Maybe integrate with GitHub, Snyk, Dependabot, etc. for version / vulnerability management.

iman-jamali-fw commented 9 months ago

@warrenchristian1telus Totally agree that we should be auditing the images before going to the PROD.

Regarding the Patroni image, I would like to hear your thoughts on the best way to address it in the short-term. I notice that the Patroni Stateful Sets and pods are linked to the BC Gov hosted image (bcgov-docker-local/patroni-postgres:2.0.1-12.4-latest), but the Patroni image stream still references the problematic image (patroni-postgres:12.4-latest). This is likely why the Patroni pods, when rebuilt, are still based on the flawed image rather than the BC Gov hosted version.

As a temporary solution, what if we rebuild the Patroni image using the BC Gov hosted image? This would ensure that we have control over an image that can be managed by BC Gov atrifactory.

iman-jamali-fw commented 9 months ago

short term: verify that in case of rebuilt of restart of Patroin the image from BC Gov hosted one is used. long-term: repository of all images

iman-jamali-fw commented 9 months ago

Created two tickets as result of this exploration: #1563 for Creation of Custom Image Artifactory for All Components: longer term plan #1562 For troubleshoot and improve Patroni defective image : shot-term fix