Open dynajoe opened 1 year ago
Can this diff be guarded behind and env vat to enable? It seems like it’s mainly useful for debugging.
We are also experiencing similar issue on ambassador 3.7.1
Found same problem with docker.io/emissaryingress/emissary:3.6.0
/ambassador $ ambassador --version
Ambassador 3.6.0
Ambassador Scout version 3.6.0
Ambassador Scout semver 3.6.0
~600 inbounds / linked pods
ambassador-74wg4
// Pod startup
time="2024-06-07 09:32:20.8985"msg="Started Ambassador (Version 3.6.0)"
[...]
// Various error logs
2024-06-07 10:09:02 diagd 3.6.0 [P29TAEW] ERROR: CACHE: IR MISMATCH
2024-06-07 11:44:48 diagd 3.6.0 [P29TAEW] ERROR: CACHE: IR MISMATCH
2024-06-07 12:03:26 diagd 3.6.0 [P29TAEW] ERROR: CACHE: IR MISMATCH
2024-06-07 13:19:11 diagd 3.6.0 [P29TAEW] ERROR: CACHE: IR MISMATCH
[...]
// 10 minutes after the last error, restart by liveness
time="2024-06-07 13:29:07.1998" level=error msg="shut down with error error: received signal terminated
Maybe related? https://github.com/emissary-ingress/emissary/issues/5279 https://github.com/emissary-ingress/emissary/issues/4744 @AliceProxy
Hi, this problem is occurring in one of my clusters, too.
I'm ussing Emissary v3.6.0, in a cluster where:
ambassador_reconfiguration_time_seconds
= 100secambassador_econf_time_seconds
= 20secambassador_ir_time_seconds
= 4secEspecially when there are frequent configuration changes, I can see the error "ERROR: CACHE: GO MISMATCH
", and 10 minutes later the Pod is restarted by Kubelet due to a failed liveness/readiness probes.
If I have understood the problem correctly...
(1)
What causes the IR MISMATCH
error is not the subject of this ticket.
For example, an IR MISMATCH
case is covered in issue https://github.com/emissary-ingress/emissary/issues/5279.
(2)
In case the IR MISMATCH
error occurs, if the configuration is very large, then processing that error can be very expensive. If it takes more than 10min (as #dynajoe explains), the diagd process will report liveness/readiness failures.
BUT... what work is being done in case of IR MISMATCH
error? (diagd.py/check_cache
function)
(I am not a Python programmer)
As far as I can see, nothing "functional" is being done: just calculating the differences between the two IR versions, in order to dump these differences in a file for possible later forensic diagnostics.
difflib/context_diff
python library is used to calculate the differences.
I have used that library to calculate differences between large IR.json example files, and I obtain these time measures:
These are just a couple of examples! (I guess the times depend not only on the size, but also on the content -- I do not intend to conclude about the computational cost of the function).
Since the comparison (I understand) is only used to generate information for diagnostic purposes, wouldn't it be convenient to enable it based on certain configuration parameters?
For example: only if AMBASSADOR_SNAPSHOT_COUNT>0
, or if logLevel=debug
.
Describe the bug IR Mismatch for large IR (orders of 100's of MB seen up to 1GB) causes
diagd
to hang and eventually fail readiness/liveness probes causing pod restart.To Reproduce Create a sizeable configuration: 2 listeners, 500 hosts, 1500 mappings Force reconfiguration by updating one of the mappings over and over. Eventually, the go process requesting snapshots will hang waiting for
diagd
to respond (it uses http default client with no timeout). Once 10 minutes have elapsed, the go process will begin reporting that diagd is unhealthy and fail probes. (The 10 minutes is the grace period for the diagd process to have updated snapshot)Expected behavior The snapshot processing queue should not be blocked on an IR mismatch.
Versions (please complete the following information):
Additional context Here's a stack trace I gathered while event processing was blocked on diffing large json files.
This is the problematic code: https://github.com/emissary-ingress/emissary/blame/0e0bd6a5ec6cf639b2ee751086e9bddc37baf150/python/ambassador_diag/diagd.py#L549