Closed iman-jamali-fw closed 9 months ago
I'm still not sure what could have possibly caused the image to revert after the manual update last week. This root cause is that the provided Patroni image was changed and broken. Cluster upgrades triggered a refresh of the image, which migrated it to production without any review/approval.
The image in question is: patroni-postgres:12.4-latest
The new image (working) is hosted with Artifactory: artifacts.developer.gov.bc.ca/bcgov-docker-local/patroni-postgres:2.0.1-12.4-latest
The new image is also out of our control, and could be changed at any time. To ensure full control over releases and production migrations, we will need to create and maintain our own artifactory project, including every image for every pod/service. This will include dev, test and prod versions so that we can maintain our standard pipeline and approval process for reviewing and implementing version updates.
This is 10 images x 3 environments, so 30 images will need to be created, tracked, updated and tested for vulnerabilities and acceptable functionality. This will likely need to be done at least twice per year per image. I recommend that we investigate a way we can automate this process where possible. Maybe integrate with GitHub, Snyk, Dependabot, etc. for version / vulnerability management.
@warrenchristian1telus Totally agree that we should be auditing the images before going to the PROD.
Regarding the Patroni image, I would like to hear your thoughts on the best way to address it in the short-term. I notice that the Patroni Stateful Sets and pods are linked to the BC Gov hosted image (bcgov-docker-local/patroni-postgres:2.0.1-12.4-latest), but the Patroni image stream still references the problematic image (patroni-postgres:12.4-latest). This is likely why the Patroni pods, when rebuilt, are still based on the flawed image rather than the BC Gov hosted version.
As a temporary solution, what if we rebuild the Patroni image using the BC Gov hosted image? This would ensure that we have control over an image that can be managed by BC Gov atrifactory.
short term: verify that in case of rebuilt of restart of Patroin the image from BC Gov hosted one is used. long-term: repository of all images
As a user of the API, Camunda, and Keycloak relying on the Patroni-managed Postgres database, I want to resolve the issue with the latest Patroni image causing downtime in the components.
Acceptance Criteria: