Closed AlexanderKaraberov closed 4 years ago
Fascinating. You said that
When I removed
couch_db:monitor(AuthDb)
line and calledgen_server:call(couch_auth_cache, reinit_cache, infinity).
in a loop in theremsh
afterwards for the sake of testing it still created all those duplicated monitors.
I can't reproduce that part on master (d59d84ac78e5). If I remove the couch_db:monitor/1
invocation and call reinit_cache in a loop I only see the one monitor for the file descriptor at any point in time.
If, however, I leave the couch_db:monitor/1
invocation in place I do see monitors piling up against the couch_db_updater
process when I call reinit in a loop. Is it possible that those monitors were the ones you saw during your debugging, and that the couch_file
monitors you observed in production have a different root cause?
Taking a step back, it's not at all obvious to me that we still need couch_auth_cache
as a running service, especially now that the backdoor HTTP port has been removed. I think the chttpd_auth_cache
module is the one doing the actual user caching. I'd be curious to hear from others on the dev team, but I'm wondering if we can just shut this thing down. Reading through the code it's clear that this module hasn't kept up with the times.
@kocolosk, @rnewson would know for sure when it comes to the removal of 5986 and this, but is the auth all handled by chttpd or is there a secondary check once it hits what's left of couch_httpd? if the former I concur with you.
Which version of erlang you are running. The reason I ask because there was a major issue with behavior of monitor/demonitor feature in 20.3.8.5 (see ERL-751)
Hello @kocolosk, and thank you for a fast reply!
If I remove the couch_db:monitor/1 invocation and call reinit_cache in a loop I only see the one monitor for the file descriptor at any point in time.
Unfortunately this is not a case for me. I'm not running current master
however I've compared a version of couch_auth_cache
from our 2.3.0
fork with the current master
and there are no differences. So here's a normal scenario when I've only launched CouchDB server:
You might see that there are two monitors currently active for the couch_auth_cache
process, one is couch_file
(<0.320.0>
) and one is couch_db_updater
(<0.319.0>
). Now I repeatedly call reinit_cache
and both of those monitors are duplicated each time:
Now when I remove the couch_db:monitor/1
invocation and repeat the same procedure again I can see that only couch_file
monitors are duplicated:
In fact this matches with what I've seen on our production nodes where there were only couch_file
monitors duplicates (several million of them) but no duplicated couch_db_updater
monitors.
Is it possible that those monitors were the ones you saw during your debugging, and that the couch_file monitors you observed in production have a different root cause?
I'm not fully excluding this possibility. However one interesting correlation as per my first post is that accumulating of these monitors and subsequent slow pread_iolist
calls exactly correspond to a spike in couchdb.auth_cache_misses
. What else could trigger this if not a cold cache?
Potentially I'm missing some changes between master
and 2.3.0 which addressed this issue?
Taking a step back, it's not at all obvious to me that we still need couch_auth_cache as a running service, especially now that the backdoor HTTP port has been removed.
This is actually great news. If someone from the dev team can indeed confirm that I can shutdown this process safely without any impact it would be wonderful. I'm no so interested in fixing the root cause of this especially taking into consideration that this is an outdated (legacy) module. I merely don't want this issue to return.
Hello @iilyak ,
Some servers are running 21.3.8.8
and some 21.3.8.11
so that bug shouldn't be related.
Reviewing the code it seemed to me that couch_auth_cache
was safe to remove as a running service, so that’s what I did in #2498. I had to disable a few tests that were written to run against the “backdoor” port, but didn’t see any issues with the rest of the test suite. If you feel like cherry-picking f4356a32d37e0c26b1423026bd17fc0c94c6dcb5 I suspect you can do so safely
Thanks! This looks good to me. I'll give it a try
I will close this ticket because the linked PR will prevent my issue from happening again in future, and I'm glad to hear that couch_auth_cache
will be removed soon. Feel free to reopen if you have some additional insights or piece of advice.
Description
At a certain time, on the 11-th of January we started to get a lot of alerts for
io_queue2.iowait.max
metric and maxIOWait
time skyrocketed to almost 5 seconds in some cases:(the drop after 21-th indicates a restart of CouchDB which made the issue to vanish). After some initial investigation in the
remsh
I discovered that top process by reductions wascouch_file
managing/data/_users.couch
DB which is odd because this seems to be an empty database with only_design/auth
in there, whilst we use a partitioned into 8 shards_users
database in the/data/shards/RANGE/_users.timestamp.couch
. First I decided to log all requests for which IOWait time here was more than 1 second. All of them turned out to bepread_iolist
calls to_users.couch
:After profiling
IOQ2
code (we enabledioq2.bypass
) I discovered that delay happens here for this gen_server:call. This suggested me to check monitors of this particularcouch_file
process. This was not in vain because there were a whopping 7 millions of them! I had grouped those by duplicates and I've found out that majority of them belonged tocouch_auth_cache
. I do not have a screenshot for that particular case because server was restarted however here's an output from another affected server where there were 2.57 millions of these duplicated monitors (couch_auth_cache
):This might explain
gen_server:call
slowness. From what I've discoveredmonitor
anddemonitor
are not entirely free in the VM because they use a red-black tree internally, and doing operations on a tree with 7+ million nodes can be somewhat slow and expensive. Quick source code check suggests that the issue might have its roots somewhere in the couch_auth_cache:reinit_cache. When I removedcouch_db:monitor(AuthDb)
line and calledgen_server:call(couch_auth_cache, reinit_cache, infinity).
in a loop in theremsh
afterwards for the sake of testing it still created all those duplicated monitors.In order to verify my claim I've checked
couchdb.auth_cache_misses
metric for the same server because repeatedreinit
s should result in a cold cache, and here we go:As you might observe this gargantuan spike (from 0-1 misses per minute to thousands of those) correlates with
io_queue2.iowait.max
metric graph provided above.Perhaps the issue lies somewhere in the repeated calls to
open_auth_db
?Any suggestion on how this can be fixed or mitigated and what potentially could be an initial cause for this
reinit_cache
spike would be terrific. Also to be honest I can't fathom whycouch_auth_cache
still references (uses?) this legacy and empty_users.couch
DB, because its naming scheme suggests this is something pre-CouchDB 2.0 (not in the shards, not partitioned and there is no timestamp in the file name). Potentially we have some bad misconfiguration somewhere and this flood ofmonitor
calls is a ramification of it?Steps to Reproduce
Easy way would be to merely call
gen_server:call(couch_auth_cache, reinit_cache, infinity).
in a loop until there will be around 7-10 million of duplicated monitors and you start observing slowness. But how this issue happened for us in the wild and on multiple servers I have no clue whatsoever so far.Expected Behaviour
Only two monitors exist (one for
db_updater
and one forcouch_file
) even in the face of repeatedreinit_cache
.Your Environment
21.3.8.8
/21.3.8.11