There have been reports on the OCP 4 Silver cluster of connectivity issues affecting the connections from other pods (e.g. backup contains, api pods) to Patroni database pods. No other types of pods seem to be affected.
The database pods seems to be running fine, but other pods are not able to connect to it. It may result in the other pods failing liveliness checks or just throwin Connection time out errors. Once the database pod was recycled - either manually or automatically after some time - the other pods were able to connect to it again.
Why is it happening?
The possible root cause of the issue has been traced to the Aporeto, Software Defined Network (SDN) Solution running in the Silver cluster. Aporeto seems to be ocassionally removing the Processing Units (PU) labels from the Patroni pods which prevents the Network Security Policies (NSP) attached to the pod from being enforced and therefore all communications to and from the pods are blocked according to the Zero-Trust Model principle.
What can I do?
We highly recommend to have health checks (readiness probes) set up for all pods connecting to Patroni database pods and make sure they are checking all dependencies required for healthy operations for a pod including connections to other pods and external databases. If a health check detects an connection issue, delete and re-created the affected database pod. Increase the number of replicas to make the app more resilient to a failure of one database pod (at minimum 3 replica pods are recommended).
What can Platform Services do?
We are working with Aporeto on releasing the fix for the removed PU labels, and expect it roll it out in Silver before the end of the week of Jan 18.
What is happening?
There have been reports on the OCP 4 Silver cluster of connectivity issues affecting the connections from other pods (e.g. backup contains, api pods) to Patroni database pods. No other types of pods seem to be affected. The database pods seems to be running fine, but other pods are not able to connect to it. It may result in the other pods failing liveliness checks or just throwin
Connection time out
errors. Once the database pod was recycled - either manually or automatically after some time - the other pods were able to connect to it again.Why is it happening? The possible root cause of the issue has been traced to the Aporeto, Software Defined Network (SDN) Solution running in the Silver cluster. Aporeto seems to be ocassionally removing the Processing Units (PU) labels from the Patroni pods which prevents the Network Security Policies (NSP) attached to the pod from being enforced and therefore all communications to and from the pods are blocked according to the Zero-Trust Model principle.
What can I do? We highly recommend to have health checks (readiness probes) set up for all pods connecting to Patroni database pods and make sure they are checking all dependencies required for healthy operations for a pod including connections to other pods and external databases. If a health check detects an connection issue, delete and re-created the affected database pod. Increase the number of replicas to make the app more resilient to a failure of one database pod (at minimum 3 replica pods are recommended).
What can Platform Services do? We are working with Aporeto on releasing the fix for the removed PU labels, and expect it roll it out in Silver before the end of the week of Jan 18.