High CPU usage while idling after running for a while

lenkan commented 2 months ago

After running KERIA for some hours on my local machine, I notice it running at 60-100%, even if I am not doing any active work with it. This has been a problem since the start, but I have not been able to find a consistent reproduction.

Here is a link to discussion in discord: https://discord.com/channels/1148629222647148624/1148734044545228831/1214946723936338003

Phil Feairheller — 03/06/2024 3:51 PM I've seen this in the past when events get stuck in escrow. Never was able to reliably reproduce either. We should take a pass through the escrows and make sure they all time out appropriately. [3:56 PM] In the Mark I agent we had an escrow endpoint that returned any events currently in escrow. It seems that endpoint didn't make its way over to KERIA unfortunately. I think a good first step would be to resurrect that endpoint.

Daniel Lenksjö — 03/06/2024 4:09 PM Thanks. I suppose that code is deleted? But I can dig it out from the kerpy history. Do you know roughly how far back I need to go?

Phil Feairheller — 03/06/2024 4:14 PM Let me take a look

Phil Feairheller — 03/06/2024 4:23 PM Here is where I created it: keripy/src/keri/app/kiwiing.py at 5464091efc0ff7076f1ec0d510a5cb3e7a829782 · WebOfTrust/keripy [4:24 PM] Good place to start, I doubt there were many enhancements to it since its inception

I could not find any related issue in the list. Creating this for visibility so people can add their findings or "+1s" if anyone else is experiencing this.

2byrds commented 2 months ago

Added to discussion for consideration for the top 10 vLEI issues: https://github.com/WebOfTrust/keri/discussions/77#discussioncomment-9188198

rodolfomiranda commented 2 months ago

were you able to get logs in debug mode?

lenkan commented 2 months ago

were you able to get logs in debug mode?

Not yet. Will start running in debug mode locally to see what I can get.

dhh1128 commented 2 months ago

Provenant has now observed this in production, on an idle system.

kentbull commented 2 months ago

Adding a voice to the mix, I've also seen this in production as well. I have a KERIA server that has been running between 60-80% CPU for around a month even there is very low usage.

2byrds commented 2 months ago

From our dev meeting: The hio loops might be running at too high of frequency (tock setting) in a tight loop. Optimizing those loop configurations/timing can be helpful. Also there maybe escrows that are not timing out if the event is never resolving/completing. Logging has improved recently making the logs cleaner.

iFergal commented 1 month ago

I have a dev KERIA instance:

da8e53f (so a small bit behind latest)
it has a lot of agents (over 3k but almost all are old/idle agents, often e2e test runs that aren't cleaning up)
a sporadic CPU usage - 2 days ago it was sitting at 15% and responding quickly. Since then it's been on 95-100% most of the time.
the actual usage even when at 100% seems very low. (at least, according to the debug logs - it's mostly just a couple of wallets polling /notifications)

I took the EscrowEnd snippet from the Mark I agent and checked against all 3k+ agents and no agent had anything in escrow:

KERIA agenting.py

``` class EscrowEnd: def __init__(self, agency: Agency): self.agency = agency def on_get(self, req, rep): escrow = req.params.get("escrow") caid = req.params.get("caid") caids = [([caid], None)] if caid else self.agency.adb.agnt.getItemIter() escrows = dict() processed = list() skipped = list() for (caid, _) in caids: try: agent = self.agency.get(caid[0]) if agent is None: skipped.append(caid[0]) continue except: skipped.append(caid[0]) continue processed.append(caid[0]) if (not escrow) or escrow == "out-of-order-events": oots = list() key = ekey = b'' # both start same. when not same means escrows found while True: for ekey, edig in agent.hby.db.getOoeItemsNextIter(key=key): pre, sn = dbing.splitKeySN(ekey) # get pre and sn from escrow item try: oots.append(eventing.loadEvent(agent.hby, pre, edig)) except ValueError as e: rep.status = falcon.HTTP_400 rep.text = e.args[0] return if ekey == key: # still same so no escrows found on last while iteration break key = ekey # setup next while iteration, with key after ekey escrows["out-of-order-events"] = oots if (not escrow) or escrow == "partially-witnessed-events": pwes = list() key = ekey = b'' # both start same. when not same means escrows found while True: # break when done for ekey, edig in agent.hby.db.getPweItemsNextIter(key=key): pre, sn = dbing.splitKeySN(ekey) # get pre and sn from escrow item try: pwes.append(eventing.loadEvent(agent.hby, pre, edig)) except ValueError as e: rep.status = falcon.HTTP_400 rep.text = e.args[0] return if ekey == key: # still same so no escrows found on last while iteration break key = ekey # setup next while iteration, with key after ekey escrows["partially-witnessed-events"] = pwes if (not escrow) or escrow == "partially-signed-events": pses = list() key = ekey = b'' # both start same. when not same means escrows found while True: # break when done for ekey, edig in agent.hby.db.getPseItemsNextIter(key=key): pre, sn = dbing.splitKeySN(ekey) # get pre and sn from escrow item try: pses.append(eventing.loadEvent(agent.hby, pre, edig)) except ValueError as e: rep.status = falcon.HTTP_400 rep.text = e.args[0] return if ekey == key: # still same so no escrows found on last while iteration break key = ekey # setup next while iteration, with key after ekey escrows["partially-signed-events"] = pses if (not escrow) or escrow == "likely-duplicitous-events": ldes = list() key = ekey = b'' # both start same. when not same means escrows found while True: # break when done for ekey, edig in agent.hby.db.getLdeItemsNextIter(key=key): pre, sn = dbing.splitKeySN(ekey) # get pre and sn from escrow item try: ldes.append(eventing.loadEvent(agent.hby, pre, edig)) except ValueError as e: rep.status = falcon.HTTP_400 rep.text = e.args[0] return if ekey == key: # still same so no escrows found on last while iteration break key = ekey # setup next while iteration, with key after ekey escrows["likely-duplicitous-events"] = ldes escrows["caids"] = processed escrows["skipped"] = skipped rep.status = falcon.HTTP_200 rep.content_type = "application/json" rep.data = json.dumps(escrows, indent=2).encode("utf-8") ```

(I had to skip over one agent when locally testing, but on this environment skipped was always empty)

I'll try to ensure that the code itself is checking the escrows correctly, since it was old code and I had to change some things.

In general it would be useful in general to know what our estimates and expectations are for how many concurrent KERIA agents or requests it should be able to handle, in case we want to service a large number of users.

iFergal commented 1 month ago

The hio loops might be running at too high of frequency (tock setting) in a tight loop. Optimizing those loop configurations/timing can be helpful.

From so far as I can tell - KERIA launches with no agents "active" and a super low CPU usage. Agent-specific loops are running after a new agent is created, or when a Signify client connects. So I guess if KERIA itself restarts, then the loops will stop until Signify reconnects - that brings up a few questions for me but I'll leave that to another time.

(also means the act of me checking for escrows in 3k agents caused all 3k to be loaded, which would explain the usage being stuck at 100% now, whereas I saw it down at 15% before that)

The agent DoDoer is running with a tock of 0.0 (default) so it loops very quickly. Passing a tock value here improves the CPU situation quite a lot for me. But a tock of even 0.1 causes the Signify multisig-inception.test.ts to go from ~10 second runtime to ~25 second on average.

I think most of the agent Doers are looping against a local in-memory queue, but escrows are being checked in LMDB. So perhaps with a large number of concurrent agents there might be many checks to the disk in case there isn't sufficient space in memory (probably even worse in Docker if there are memory limits - today I just checked against a fresh KERIA locally w/o Docker).

Maybe we could fine tune the escrow checks then. I'm not so familiar with HIO but the escrow loop seems to be the non-generator variant, so not sure yet how this could be done.

I'm speculating some things here due to lack of familiarity. @pfeairheller would this make sense? :)

Edit: Oobiery reads from DB on every loop too so another place to possibly fine tune.

dhh1128 commented 3 weeks ago

We have now seen this in production, too. Here is a nearly idle instance of Keria.

kentbull commented 3 weeks ago

I noticed on my GCP deployment I had to allocate 2 entire CPUs to my KERIA instance to get it to not take 100% CPU. It leveled out at about 0.65 to 0.85 CPUs while idling, which is still unexpected. I would think that an idling HIO loop would not take this much CPU. The only thing that makes sense to me that could be occurring is that events are getting stuck in escrow loops and are being reprocessed, meaning saved to and retrieved from disk, using the CPU for serialization and deserialization repeatedly.

I've been wanting to use the GCP Python instrumentation to get a flame graph of what is eating up the CPU time.

dhh1128 commented 3 weeks ago

@rubelhassan , do you have bandwidth to dig into this, possibly using @kentbull 's suggestion about how to diagnose?

rubelhassan commented 3 weeks ago

dhh1128

I was able to reproduce the scenario locally by running signify integration tests. During idle time the CPU usage was around ~50%. I ran keria on DEBUG mode and found this log printed repeatedly.

keri: Kevery unescrow attempt failed: Unverified end role reply. {'v': 'KERI10JSON000111_', 't': 'rpy', 'd': 'EIWFxJYdxvreQFdY04V7kCP9Wrm0ne6GZKjPO60CiGYH', 'dt': '2024-06-13T14:55:23.682000+00:00', 'r': '/end/role/add', 'a': {'cid': 'ELjRm-u0S8u3h_oJFjFilxJJqV68m1wIYiO1gZkIbNTs', 'role': 'agent', 'eid': 'EOhlELWBJVh-swQjbnzufD9Us2GgCcSCjCJ3ubONP7Lz'}}
Traceback (most recent call last):
  File "/keria/keria/lib/python3.10/site-packages/keri/core/routing.py", line 481, in processEscrowReply
    self.processReply(serder=serder, tsgs=tsgs)
  File "/keria/keria/lib/python3.10/site-packages/keri/core/routing.py", line 199, in processReply
    self.rtr.dispatch(serder=serder, saider=saider, cigars=cigars, tsgs=tsgs)
  File "/keria/keria/lib/python3.10/site-packages/keri/core/routing.py", line 84, in dispatch
    fn(serder=serder, saider=saider, route=r, cigars=cigars, tsgs=tsgs, **kwargs)
  File "/keria/keria/lib/python3.10/site-packages/keri/core/eventing.py", line 3788, in processReplyEndRole
    raise UnverifiedReplyError(f"Unverified end role reply. {serder.ked}")
keri.kering.UnverifiedReplyError: Unverified end role reply. {'v': 'KERI10JSON000111_', 't': 'rpy', 'd': 'EIWFxJYdxvreQFdY04V7kCP9Wrm0ne6GZKjPO60CiGYH', 'dt': '2024-06-13T14:55:23.682000+00:00', 'r': '/end/role/add', 'a': {'cid': 'ELjRm-u0S8u3h_oJFjFilxJJqV68m1wIYiO1gZkIbNTs', 'role': 'agent', 'eid': 'EOhlELWBJVh-swQjbnzufD9Us2GgCcSCjCJ3ubONP7Lz'}}
keri: Kevery process: skipped stale keystate sig datetime from ELjRm-u0S8u3h_oJFjFilxJJqV68m1wIYiO1gZkIbNTs on reply msg=
{
 "v": "KERI10JSON000111_",
 "t": "rpy",
 "d": "EGtQVdVN-P4RHBvM9Umun9LEWG80l9E8piLIOEJKTHSf",
 "dt": "2024-06-13T14:55:23.682000+00:00",
 "r": "/end/role/add",
 "a": {
  "cid": "ELjRm-u0S8u3h_oJFjFilxJJqV68m1wIYiO1gZkIbNTs",
  "role": "agent",
  "eid": "EHyweFpbT26cXpGy4KHXlG8pWLmqf1sIRe2wBz4EP65F"
 }
}

I launched keria using cProfile and here is the cProfile stats (runtime: 326.685 seconds) of important function calls -

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
225074    0.130    0.000    0.251    0.000 agenting.py:197(witDo)
225074    0.135    0.000    0.199    0.000 agenting.py:441(msgDo)
 22505    0.031    0.000    1.564    0.000 credentialing.py:115(processEscrows)
 22504    0.055    0.000    0.621    0.000 credentialing.py:836(processEscrows)
 22504    0.047    0.000    0.239    0.000 credentialing.py:845(processWitnessEscrow)
 22504    0.026    0.000    0.160    0.000 credentialing.py:876(processMultisigEscrow)
 22504    0.030    0.000    0.208    0.000 credentialing.py:974(processCredentialMissingSigEscrow)
180224    0.286    0.000    0.666    0.000 dbing.py:1520(getIoItemsNextIter)
723478    1.759    0.000    3.454    0.000 dbing.py:528(getTopItemIter)
225045    0.298    0.000    2.069    0.000 escrowing.py:65(processEscrowState)
 45009    0.359    0.000    3.145    0.000 eventing.py:2036(processEscrows)
 45009    0.071    0.000    0.261    0.000 eventing.py:2059(processEscrowOutOfOrders)
 45009    0.097    0.000    0.456    0.000 eventing.py:2125(processEscrowAnchorless)
 27622    0.738    0.000   34.691    0.001 eventing.py:3701(processReplyEndRole)
 22505    0.150    0.000    1.905    0.000 eventing.py:4459(processEscrows)
 22505    0.077    0.000    0.220    0.000 eventing.py:4483(processEscrowOutOfOrders)
 22505    0.056    0.000    0.631    0.000 eventing.py:4619(processEscrowPartialSigs)
 22505    0.051    0.000    0.138    0.000 eventing.py:4778(processEscrowPartialWigs)
111574   29.350    0.000   29.350    0.000 {method 'write' of '_io.TextIOWrapper' objects}

pfeairheller commented 3 weeks ago

There are three changes that need to be made to address this, one of which is in KERIpy and already done on the main branch.

Remove the log messages that are in the direct line of fire of escrow processing. I thought I caught them all a few weeks ago but missed the "Unescrow event" error messages. Technically I didn't miss them, I changed them but not enough. They are now removed completely.
Tweak the cycle time of escrow processing in the Escrower in KERIA (and all other Doers for that matter). The easiest way to do this with Doers is to set self.tock to something other than 0. It represents a number of seconds that this Doer can wait between cycles. So setting self.tock = 30.0 would change it from running every cycle to running every 30 seconds (roughly). This was an anticipated optimization that is required but I never ran KERIA in production to analyze the behavior and determine good numbers for all the Doers. Every Doer defined here should have analysis run on it to determine the proper frequency cycles balancing performance with user experience. Important to remember, there are Doers like this for each agent in the system.
Ensure that the escrows have proper time outs and that the time outs are being honored. Again, this is going to require some analysis of each escrow to determine appropriate values for timing out an event based on the various conditions of the escrows. Some should be seconds, minutes or hours based on what the escrow is "waiting for". This is a KERIpy change and one that will require coordination or perhaps a change to make it configurable as I fear that these values may be implementation dependent.

I can help with the KERIpy change in 3 if needed over the next few weeks, but honestly the most bang for your buck will be 1 that I already fixed and 2 that someone running KERIA in production should fix.

dhh1128 commented 3 weeks ago

@rubelhassan , please charge ahead with Phil's advice, consulting with @Arsh-Sandhu and @rodolfomiranda re. data from our production experience.

iFergal commented 3 weeks ago

@rubelhassan I was looking at this and had to drop to other work and will be going on vacation soon, but number 2 helped quite a bit for me if you want to take the below adjustments.

I had to change the recur of the Escrower to the generator type in HIO instead of returning True:

while True:
    self.kvy.processEscrows()
    self.kvy.processEscrowDelegables()
    self.rgy.processEscrows()
    self.rvy.processEscrowReply()
    if self.tvy is not None:
        self.tvy.processEscrows()
    self.exc.processEscrow()
    self.vry.processEscrows()
    self.registrar.processEscrows()
    self.credentialer.processEscrows()

    yield self.tock

And passing a tock value to scoobiDo in keripy: https://github.com/cardano-foundation/keripy/commit/bb98bbe4e5597c300e9c682863fa21400ce7b163

Question is what tock value to set. It would be nice if we could pick these up from some config so that production KERIA can have higher tock rates than say integration tests for Signify-TS.

kentbull commented 2 weeks ago

While adjusting the tock values seem to be a good way to address runaway escrow message processing it doesn't seem to solve the core issue which is the reprocessing of the unverified event.

The error you saw, @rubelhassan, is something I have seen as well:

keri: Kevery unescrow attempt failed: Unverified end role reply.

This is something I have seen before when running things locally. It has to do with the BADA-RUN logic on accepting the reply message for adding or removing a witness. I will open a bug in KERIpy to address this.

WebOfTrust / keria

High CPU usage while idling after running for a while #232