Readiness/Liveness probe failure after IR Mismatch

dynajoe commented 1 year ago

Describe the bug IR Mismatch for large IR (orders of 100's of MB seen up to 1GB) causes diagd to hang and eventually fail readiness/liveness probes causing pod restart.

To Reproduce Create a sizeable configuration: 2 listeners, 500 hosts, 1500 mappings Force reconfiguration by updating one of the mappings over and over. Eventually, the go process requesting snapshots will hang waiting for diagd to respond (it uses http default client with no timeout). Once 10 minutes have elapsed, the go process will begin reporting that diagd is unhealthy and fail probes. (The 10 minutes is the grace period for the diagd process to have updated snapshot)

Expected behavior The snapshot processing queue should not be blocked on an IR mismatch.

Versions (please complete the following information):

Ambassador: 2.1.2
Kubernetes environment: EKS 1.22

Additional context Here's a stack trace I gathered while event processing was blocked on diffing large json files.

This is the problematic code: https://github.com/emissary-ingress/emissary/blame/0e0bd6a5ec6cf639b2ee751086e9bddc37baf150/python/ambassador_diag/diagd.py#L549

<_MainThread(MainThread, started 140439936146760)>
  File "/usr/bin/diagd", line 33, in <module>
    sys.exit(load_entry_point('ambassador==0.0.0.dev0', 'console_scripts', 'diagd')())
  File "/usr/lib/python3.8/site-packages/ambassador-0.0.0.dev0-py3.8.egg/ambassador_diag/diagd.py", line 2174, in main
    clize.run(_main)
  File "/usr/lib/python3.8/site-packages/sigtools/modifiers.py", line 158, in __call__
    return self.func(*args, **kwargs)
  File "/usr/lib/python3.8/site-packages/clize/runner.py", line 363, in run
    ret = cli(*args)
  File "/usr/lib/python3.8/site-packages/clize/runner.py", line 220, in __call__
    return func(*posargs, **kwargs)
  File "/usr/lib/python3.8/site-packages/ambassador-0.0.0.dev0-py3.8.egg/ambassador_diag/diagd.py", line 2166, in _main
    StandaloneApplication(app, gunicorn_config).run()
  File "/usr/lib/python3.8/site-packages/gunicorn/app/base.py", line 72, in run
    Arbiter(self).run()
  File "/usr/lib/python3.8/site-packages/gunicorn/arbiter.py", line 202, in run
    self.manage_workers()
  File "/usr/lib/python3.8/site-packages/gunicorn/arbiter.py", line 551, in manage_workers
    self.spawn_workers()
  File "/usr/lib/python3.8/site-packages/gunicorn/arbiter.py", line 622, in spawn_workers
    self.spawn_worker()
  File "/usr/lib/python3.8/site-packages/gunicorn/arbiter.py", line 589, in spawn_worker
    worker.init_process()
  File "/usr/lib/python3.8/site-packages/gunicorn/workers/gthread.py", line 92, in init_process
    super().init_process()
  File "/usr/lib/python3.8/site-packages/gunicorn/workers/base.py", line 142, in init_process
    self.run()
  File "/usr/lib/python3.8/site-packages/gunicorn/workers/gthread.py", line 200, in run
    events = self.poller.select(1.0)
  File "/usr/lib/python3.8/selectors.py", line 468, in select
    fd_event_list = self._selector.poll(timeout, max_ev)
<AmbassadorEventWatcher(AEW, started daemon 140439895325472)>
  File "/usr/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.8/site-packages/ambassador-0.0.0.dev0-py3.8.egg/ambassador_diag/diagd.py", line 1379, in run
    self.app.check_timers()
  File "/usr/lib/python3.8/site-packages/ambassador-0.0.0.dev0-py3.8.egg/ambassador_diag/diagd.py", line 441, in check_timers
    result = self.check_cache()
  File "/usr/lib/python3.8/site-packages/ambassador-0.0.0.dev0-py3.8.egg/ambassador_diag/diagd.py", line 504, in check_cache
    errors += self.json_diff("IR", i1, i2)
  File "/usr/lib/python3.8/site-packages/ambassador-0.0.0.dev0-py3.8.egg/ambassador_diag/diagd.py", line 461, in json_diff
    for line in difflib.context_diff(l1, l2, fromfile=n1, tofile=n2):
  File "/usr/lib/python3.8/difflib.py", line 1257, in context_diff
    for group in SequenceMatcher(None,a,b).get_grouped_opcodes(n):
  File "/usr/lib/python3.8/difflib.py", line 597, in get_grouped_opcodes
    codes = self.get_opcodes()
  File "/usr/lib/python3.8/difflib.py", line 550, in get_opcodes
    for ai, bj, size in self.get_matching_blocks():
  File "/usr/lib/python3.8/difflib.py", line 479, in get_matching_blocks
    i, j, k = x = self.find_longest_match(alo, ahi, blo, bhi)
  File "/usr/lib/python3.8/difflib.py", line 410, in find_longest_match
    k = newj2len[j] = j2lenget(j-1, 0) + 1
<PeriodicTrigger(Thread-2, started daemon 140439893203744)>
  File "/usr/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.8/site-packages/ambassador-0.0.0.dev0-py3.8.egg/ambassador/utils.py", line 501, in run
    time.sleep(self.period)
<PeriodicTrigger(Thread-3, started daemon 140439892142880)>
  File "/usr/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.8/site-packages/ambassador-0.0.0.dev0-py3.8.egg/ambassador/utils.py", line 502, in run
    self.onfired()
  File "/usr/lib/python3.8/site-packages/ambassador-0.0.0.dev0-py3.8.egg/ambassador_diag/diagd.py", line 408, in post_timer_event
    self.watcher.post("TIMER", None)
  File "/usr/lib/python3.8/site-packages/ambassador-0.0.0.dev0-py3.8.egg/ambassador_diag/diagd.py", line 1331, in post
    return rqueue.get()
  File "/usr/lib/python3.8/queue.py", line 170, in get
    self.not_empty.wait()
  File "/usr/lib/python3.8/threading.py", line 302, in wait
    waiter.acquire()
<Thread(ThreadPoolExecutor-0_0, started daemon 140439891082016)>
  File "/usr/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.8/concurrent/futures/thread.py", line 78, in _worker
    work_item = work_queue.get(block=True)
<Thread(ThreadPoolExecutor-0_1, started daemon 140439850437408)>
  File "/usr/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.8/concurrent/futures/thread.py", line 78, in _worker
    work_item = work_queue.get(block=True)
<PeriodicTrigger(Thread-6, started daemon 140439825029920)>
  File "/usr/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.8/site-packages/ambassador-0.0.0.dev0-py3.8.egg/ambassador/utils.py", line 501, in run
    time.sleep(self.period)
<Thread(ThreadPoolExecutor-0_2, started daemon 140439812660000)>
  File "/usr/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.8/concurrent/futures/thread.py", line 80, in _worker
    work_item.run()
  File "/usr/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/lib/python3.8/site-packages/gunicorn/workers/gthread.py", line 271, in handle
    keepalive = self.handle_request(req, conn)
  File "/usr/lib/python3.8/site-packages/gunicorn/workers/gthread.py", line 323, in handle_request
    respiter = self.wsgi(environ, resp.start_response)
  File "/usr/lib/python3.8/site-packages/flask/app.py", line 2091, in __call__
    return self.wsgi_app(environ, start_response)
  File "/usr/lib/python3.8/site-packages/flask/app.py", line 2073, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/lib/python3.8/site-packages/flask/app.py", line 1516, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/lib/python3.8/site-packages/flask/app.py", line 1502, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
  File "/usr/lib/python3.8/site-packages/ambassador-0.0.0.dev0-py3.8.egg/ambassador_diag/diagd.py", line 640, in wrapper
    return f(*args, **kwds)
  File "/usr/lib/python3.8/site-packages/ambassador-0.0.0.dev0-py3.8.egg/ambassador_diag/diagd.py", line 863, in handle_watt_update
    status, info = app.watcher.post('CONFIG', ( 'watt', url ))
  File "/usr/lib/python3.8/site-packages/ambassador-0.0.0.dev0-py3.8.egg/ambassador_diag/diagd.py", line 1331, in post
    return rqueue.get()
  File "/usr/lib/python3.8/queue.py", line 170, in get
    self.not_empty.wait()
  File "/usr/lib/python3.8/threading.py", line 302, in wait
    waiter.acquire()
<Timer(Thread-108, started daemon 140438690839328)>
  File "/usr/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.8/threading.py", line 1254, in run
    self.function(*self.args, **self.kwargs)
  File "/usr/lib/python3.8/site-packages/ambassador-0.0.0.dev0-py3.8.egg/ambassador_diag/diagd.py", line 80, in print_stack
    traceback.print_stack(sys._current_frames()[thread.ident])

dynajoe commented 1 year ago

Can this diff be guarded behind and env vat to enable? It seems like it’s mainly useful for debugging.

hibohra1398 commented 7 months ago

We are also experiencing similar issue on ambassador 3.7.1

jisnardo commented 1 month ago

Found same problem with docker.io/emissaryingress/emissary:3.6.0

/ambassador $ ambassador --version
Ambassador 3.6.0
Ambassador Scout version 3.6.0
Ambassador Scout semver  3.6.0

~600 inbounds / linked pods

ambassador-74wg4

// Pod startup
time="2024-06-07 09:32:20.8985"msg="Started Ambassador (Version 3.6.0)"
[...]
// Various error logs
2024-06-07 10:09:02 diagd 3.6.0 [P29TAEW] ERROR: CACHE: IR MISMATCH
2024-06-07 11:44:48 diagd 3.6.0 [P29TAEW] ERROR: CACHE: IR MISMATCH
2024-06-07 12:03:26 diagd 3.6.0 [P29TAEW] ERROR: CACHE: IR MISMATCH
2024-06-07 13:19:11 diagd 3.6.0 [P29TAEW] ERROR: CACHE: IR MISMATCH
[...]
// 10 minutes after the last error, restart by liveness
time="2024-06-07 13:29:07.1998" level=error msg="shut down with error error: received signal terminated

juanjoku commented 1 month ago

Hi, this problem is occurring in one of my clusters, too.

I'm ussing Emissary v3.6.0, in a cluster where:

There are ~600 hosts, ~1200 mappings
Some metrics have very high values. P.e.:
- ambassador_reconfiguration_time_seconds = 100sec
- ambassador_econf_time_seconds = 20sec
- ambassador_ir_time_seconds = 4sec
There are frequent configuration changes that affect Emissary (hosts creation/deletions, changes in mapping destinations)

Especially when there are frequent configuration changes, I can see the error "ERROR: CACHE: GO MISMATCH", and 10 minutes later the Pod is restarted by Kubelet due to a failed liveness/readiness probes.

juanjoku commented 3 weeks ago

If I have understood the problem correctly...

(1) What causes the IR MISMATCH error is not the subject of this ticket. For example, an IR MISMATCH case is covered in issue https://github.com/emissary-ingress/emissary/issues/5279.

(2) In case the IR MISMATCH error occurs, if the configuration is very large, then processing that error can be very expensive. If it takes more than 10min (as #dynajoe explains), the diagd process will report liveness/readiness failures.

BUT... what work is being done in case of IR MISMATCHerror? (diagd.py/check_cache function)

(I am not a Python programmer)

As far as I can see, nothing "functional" is being done: just calculating the differences between the two IR versions, in order to dump these differences in a file for possible later forensic diagnostics.

difflib/context_diff python library is used to calculate the differences. I have used that library to calculate differences between large IR.json example files, and I obtain these time measures:

Comparing 20MB files: ~2min
Comparing 40MB files: ~11min

These are just a couple of examples! (I guess the times depend not only on the size, but also on the content -- I do not intend to conclude about the computational cost of the function).

Since the comparison (I understand) is only used to generate information for diagnostic purposes, wouldn't it be convenient to enable it based on certain configuration parameters? For example: only if AMBASSADOR_SNAPSHOT_COUNT>0, or if logLevel=debug.

emissary-ingress / emissary

Readiness/Liveness probe failure after IR Mismatch #4490