Open lenkan opened 2 months ago
Added to discussion for consideration for the top 10 vLEI issues: https://github.com/WebOfTrust/keri/discussions/77#discussioncomment-9188198
were you able to get logs in debug mode?
were you able to get logs in debug mode?
Not yet. Will start running in debug mode locally to see what I can get.
Provenant has now observed this in production, on an idle system.
Adding a voice to the mix, I've also seen this in production as well. I have a KERIA server that has been running between 60-80% CPU for around a month even there is very low usage.
From our dev meeting: The hio loops might be running at too high of frequency (tock setting) in a tight loop. Optimizing those loop configurations/timing can be helpful. Also there maybe escrows that are not timing out if the event is never resolving/completing. Logging has improved recently making the logs cleaner.
I have a dev KERIA instance:
/notifications
)I took the EscrowEnd
snippet from the Mark I agent and checked against all 3k+ agents and no agent had anything in escrow:
(I had to skip over one agent when locally testing, but on this environment skipped
was always empty)
I'll try to ensure that the code itself is checking the escrows correctly, since it was old code and I had to change some things.
In general it would be useful in general to know what our estimates and expectations are for how many concurrent KERIA agents or requests it should be able to handle, in case we want to service a large number of users.
The hio loops might be running at too high of frequency (tock setting) in a tight loop. Optimizing those loop configurations/timing can be helpful.
From so far as I can tell - KERIA launches with no agents "active" and a super low CPU usage. Agent-specific loops are running after a new agent is created, or when a Signify client connects. So I guess if KERIA itself restarts, then the loops will stop until Signify reconnects - that brings up a few questions for me but I'll leave that to another time.
(also means the act of me checking for escrows in 3k agents caused all 3k to be loaded, which would explain the usage being stuck at 100% now, whereas I saw it down at 15% before that)
The agent DoDoer
is running with a tock of 0.0
(default) so it loops very quickly. Passing a tock value here improves the CPU situation quite a lot for me. But a tock of even 0.1
causes the Signify multisig-inception.test.ts
to go from ~10 second runtime to ~25 second on average.
I think most of the agent Doer
s are looping against a local in-memory queue, but escrows are being checked in LMDB. So perhaps with a large number of concurrent agents there might be many checks to the disk in case there isn't sufficient space in memory (probably even worse in Docker if there are memory limits - today I just checked against a fresh KERIA locally w/o Docker).
Maybe we could fine tune the escrow checks then. I'm not so familiar with HIO but the escrow loop seems to be the non-generator variant, so not sure yet how this could be done.
I'm speculating some things here due to lack of familiarity. @pfeairheller would this make sense? :)
Edit: Oobiery reads from DB on every loop too so another place to possibly fine tune.
We have now seen this in production, too. Here is a nearly idle instance of Keria.
I noticed on my GCP deployment I had to allocate 2 entire CPUs to my KERIA instance to get it to not take 100% CPU. It leveled out at about 0.65 to 0.85 CPUs while idling, which is still unexpected. I would think that an idling HIO loop would not take this much CPU. The only thing that makes sense to me that could be occurring is that events are getting stuck in escrow loops and are being reprocessed, meaning saved to and retrieved from disk, using the CPU for serialization and deserialization repeatedly.
I've been wanting to use the GCP Python instrumentation to get a flame graph of what is eating up the CPU time.
@rubelhassan , do you have bandwidth to dig into this, possibly using @kentbull 's suggestion about how to diagnose?
dhh1128
I was able to reproduce the scenario locally by running signify integration tests. During idle time the CPU usage was around ~50%. I ran keria on DEBUG mode and found this log printed repeatedly.
keri: Kevery unescrow attempt failed: Unverified end role reply. {'v': 'KERI10JSON000111_', 't': 'rpy', 'd': 'EIWFxJYdxvreQFdY04V7kCP9Wrm0ne6GZKjPO60CiGYH', 'dt': '2024-06-13T14:55:23.682000+00:00', 'r': '/end/role/add', 'a': {'cid': 'ELjRm-u0S8u3h_oJFjFilxJJqV68m1wIYiO1gZkIbNTs', 'role': 'agent', 'eid': 'EOhlELWBJVh-swQjbnzufD9Us2GgCcSCjCJ3ubONP7Lz'}}
Traceback (most recent call last):
File "/keria/keria/lib/python3.10/site-packages/keri/core/routing.py", line 481, in processEscrowReply
self.processReply(serder=serder, tsgs=tsgs)
File "/keria/keria/lib/python3.10/site-packages/keri/core/routing.py", line 199, in processReply
self.rtr.dispatch(serder=serder, saider=saider, cigars=cigars, tsgs=tsgs)
File "/keria/keria/lib/python3.10/site-packages/keri/core/routing.py", line 84, in dispatch
fn(serder=serder, saider=saider, route=r, cigars=cigars, tsgs=tsgs, **kwargs)
File "/keria/keria/lib/python3.10/site-packages/keri/core/eventing.py", line 3788, in processReplyEndRole
raise UnverifiedReplyError(f"Unverified end role reply. {serder.ked}")
keri.kering.UnverifiedReplyError: Unverified end role reply. {'v': 'KERI10JSON000111_', 't': 'rpy', 'd': 'EIWFxJYdxvreQFdY04V7kCP9Wrm0ne6GZKjPO60CiGYH', 'dt': '2024-06-13T14:55:23.682000+00:00', 'r': '/end/role/add', 'a': {'cid': 'ELjRm-u0S8u3h_oJFjFilxJJqV68m1wIYiO1gZkIbNTs', 'role': 'agent', 'eid': 'EOhlELWBJVh-swQjbnzufD9Us2GgCcSCjCJ3ubONP7Lz'}}
keri: Kevery process: skipped stale keystate sig datetime from ELjRm-u0S8u3h_oJFjFilxJJqV68m1wIYiO1gZkIbNTs on reply msg=
{
"v": "KERI10JSON000111_",
"t": "rpy",
"d": "EGtQVdVN-P4RHBvM9Umun9LEWG80l9E8piLIOEJKTHSf",
"dt": "2024-06-13T14:55:23.682000+00:00",
"r": "/end/role/add",
"a": {
"cid": "ELjRm-u0S8u3h_oJFjFilxJJqV68m1wIYiO1gZkIbNTs",
"role": "agent",
"eid": "EHyweFpbT26cXpGy4KHXlG8pWLmqf1sIRe2wBz4EP65F"
}
}
I launched keria using cProfile
and here is the cProfile stats (runtime: 326.685 seconds) of important function calls -
ncalls tottime percall cumtime percall filename:lineno(function)
225074 0.130 0.000 0.251 0.000 agenting.py:197(witDo)
225074 0.135 0.000 0.199 0.000 agenting.py:441(msgDo)
22505 0.031 0.000 1.564 0.000 credentialing.py:115(processEscrows)
22504 0.055 0.000 0.621 0.000 credentialing.py:836(processEscrows)
22504 0.047 0.000 0.239 0.000 credentialing.py:845(processWitnessEscrow)
22504 0.026 0.000 0.160 0.000 credentialing.py:876(processMultisigEscrow)
22504 0.030 0.000 0.208 0.000 credentialing.py:974(processCredentialMissingSigEscrow)
180224 0.286 0.000 0.666 0.000 dbing.py:1520(getIoItemsNextIter)
723478 1.759 0.000 3.454 0.000 dbing.py:528(getTopItemIter)
225045 0.298 0.000 2.069 0.000 escrowing.py:65(processEscrowState)
45009 0.359 0.000 3.145 0.000 eventing.py:2036(processEscrows)
45009 0.071 0.000 0.261 0.000 eventing.py:2059(processEscrowOutOfOrders)
45009 0.097 0.000 0.456 0.000 eventing.py:2125(processEscrowAnchorless)
27622 0.738 0.000 34.691 0.001 eventing.py:3701(processReplyEndRole)
22505 0.150 0.000 1.905 0.000 eventing.py:4459(processEscrows)
22505 0.077 0.000 0.220 0.000 eventing.py:4483(processEscrowOutOfOrders)
22505 0.056 0.000 0.631 0.000 eventing.py:4619(processEscrowPartialSigs)
22505 0.051 0.000 0.138 0.000 eventing.py:4778(processEscrowPartialWigs)
111574 29.350 0.000 29.350 0.000 {method 'write' of '_io.TextIOWrapper' objects}
There are three changes that need to be made to address this, one of which is in KERIpy and already done on the main
branch.
self.tock
to something other than 0. It represents a number of seconds that this Doer can wait between cycles. So setting self.tock = 30.0
would change it from running every cycle to running every 30 seconds (roughly). This was an anticipated optimization that is required but I never ran KERIA in production to analyze the behavior and determine good numbers for all the Doers. Every Doer defined here should have analysis run on it to determine the proper frequency cycles balancing performance with user experience. Important to remember, there are Doers like this for each agent in the system.I can help with the KERIpy change in 3 if needed over the next few weeks, but honestly the most bang for your buck will be 1 that I already fixed and 2 that someone running KERIA in production should fix.
@rubelhassan , please charge ahead with Phil's advice, consulting with @Arsh-Sandhu and @rodolfomiranda re. data from our production experience.
@rubelhassan I was looking at this and had to drop to other work and will be going on vacation soon, but number 2 helped quite a bit for me if you want to take the below adjustments.
I had to change the recur of the Escrower to the generator type in HIO instead of returning True:
while True:
self.kvy.processEscrows()
self.kvy.processEscrowDelegables()
self.rgy.processEscrows()
self.rvy.processEscrowReply()
if self.tvy is not None:
self.tvy.processEscrows()
self.exc.processEscrow()
self.vry.processEscrows()
self.registrar.processEscrows()
self.credentialer.processEscrows()
yield self.tock
And passing a tock value to scoobiDo
in keripy: https://github.com/cardano-foundation/keripy/commit/bb98bbe4e5597c300e9c682863fa21400ce7b163
Question is what tock value to set. It would be nice if we could pick these up from some config so that production KERIA can have higher tock rates than say integration tests for Signify-TS.
While adjusting the tock values seem to be a good way to address runaway escrow message processing it doesn't seem to solve the core issue which is the reprocessing of the unverified event.
The error you saw, @rubelhassan, is something I have seen as well:
keri: Kevery unescrow attempt failed: Unverified end role reply.
This is something I have seen before when running things locally. It has to do with the BADA-RUN logic on accepting the reply message for adding or removing a witness. I will open a bug in KERIpy to address this.
After running KERIA for some hours on my local machine, I notice it running at 60-100%, even if I am not doing any active work with it. This has been a problem since the start, but I have not been able to find a consistent reproduction.
Here is a link to discussion in discord: https://discord.com/channels/1148629222647148624/1148734044545228831/1214946723936338003
I could not find any related issue in the list. Creating this for visibility so people can add their findings or "+1s" if anyone else is experiencing this.