apache / couchdb

Seamless multi-master syncing database with an intuitive HTTP/JSON API, designed for reliability
https://couchdb.apache.org/
Apache License 2.0
6.24k stars 1.03k forks source link

couch_auth_cache creates millions of monitors to _users.couch database which introduces a performance drop #2493

Closed AlexanderKaraberov closed 4 years ago

AlexanderKaraberov commented 4 years ago

Description

At a certain time, on the 11-th of January we started to get a lot of alerts for io_queue2.iowait.max metric and max IOWait time skyrocketed to almost 5 seconds in some cases:

Screenshot 2020-01-27 at 11 40 41

(the drop after 21-th indicates a restart of CouchDB which made the issue to vanish). After some initial investigation in the remsh I discovered that top process by reductions was couch_file managing /data/_users.couch DB which is odd because this seems to be an empty database with only _design/auth in there, whilst we use a partitioned into 8 shards _users database in the /data/shards/RANGE/_users.timestamp.couch. First I decided to log all requests for which IOWait time here was more than 1 second. All of them turned out to be pread_iolist calls to _users.couch:

iowait

After profiling IOQ2 code (we enabled ioq2.bypass) I discovered that delay happens here for this gen_server:call. This suggested me to check monitors of this particular couch_file process. This was not in vain because there were a whopping 7 millions of them! I had grouped those by duplicates and I've found out that majority of them belonged to couch_auth_cache. I do not have a screenshot for that particular case because server was restarted however here's an output from another affected server where there were 2.57 millions of these duplicated monitors (couch_auth_cache):

monitors

This might explain gen_server:call slowness. From what I've discovered monitor and demonitor are not entirely free in the VM because they use a red-black tree internally, and doing operations on a tree with 7+ million nodes can be somewhat slow and expensive. Quick source code check suggests that the issue might have its roots somewhere in the couch_auth_cache:reinit_cache. When I removed couch_db:monitor(AuthDb) line and called gen_server:call(couch_auth_cache, reinit_cache, infinity). in a loop in the remsh afterwards for the sake of testing it still created all those duplicated monitors.

In order to verify my claim I've checked couchdb.auth_cache_misses metric for the same server because repeated reinits should result in a cold cache, and here we go:

cache_misses

As you might observe this gargantuan spike (from 0-1 misses per minute to thousands of those) correlates with io_queue2.iowait.max metric graph provided above.

Perhaps the issue lies somewhere in the repeated calls to open_auth_db?

Any suggestion on how this can be fixed or mitigated and what potentially could be an initial cause for this reinit_cache spike would be terrific. Also to be honest I can't fathom why couch_auth_cache still references (uses?) this legacy and empty _users.couch DB, because its naming scheme suggests this is something pre-CouchDB 2.0 (not in the shards, not partitioned and there is no timestamp in the file name). Potentially we have some bad misconfiguration somewhere and this flood of monitor calls is a ramification of it?

Steps to Reproduce

Easy way would be to merely call gen_server:call(couch_auth_cache, reinit_cache, infinity). in a loop until there will be around 7-10 million of duplicated monitors and you start observing slowness. But how this issue happened for us in the wild and on multiple servers I have no clue whatsoever so far.

Expected Behaviour

Only two monitors exist (one for db_updater and one for couch_file) even in the face of repeated reinit_cache.

Your Environment

kocolosk commented 4 years ago

Fascinating. You said that

When I removed couch_db:monitor(AuthDb) line and called gen_server:call(couch_auth_cache, reinit_cache, infinity). in a loop in the remsh afterwards for the sake of testing it still created all those duplicated monitors.

I can't reproduce that part on master (d59d84ac78e5). If I remove the couch_db:monitor/1 invocation and call reinit_cache in a loop I only see the one monitor for the file descriptor at any point in time.

If, however, I leave the couch_db:monitor/1 invocation in place I do see monitors piling up against the couch_db_updater process when I call reinit in a loop. Is it possible that those monitors were the ones you saw during your debugging, and that the couch_file monitors you observed in production have a different root cause?

Taking a step back, it's not at all obvious to me that we still need couch_auth_cache as a running service, especially now that the backdoor HTTP port has been removed. I think the chttpd_auth_cache module is the one doing the actual user caching. I'd be curious to hear from others on the dev team, but I'm wondering if we can just shut this thing down. Reading through the code it's clear that this module hasn't kept up with the times.

wohali commented 4 years ago

@kocolosk, @rnewson would know for sure when it comes to the removal of 5986 and this, but is the auth all handled by chttpd or is there a secondary check once it hits what's left of couch_httpd? if the former I concur with you.

iilyak commented 4 years ago

Which version of erlang you are running. The reason I ask because there was a major issue with behavior of monitor/demonitor feature in 20.3.8.5 (see ERL-751)

AlexanderKaraberov commented 4 years ago

Hello @kocolosk, and thank you for a fast reply!

If I remove the couch_db:monitor/1 invocation and call reinit_cache in a loop I only see the one monitor for the file descriptor at any point in time.

Unfortunately this is not a case for me. I'm not running current master however I've compared a version of couch_auth_cache from our 2.3.0 fork with the current master and there are no differences. So here's a normal scenario when I've only launched CouchDB server:

standard

You might see that there are two monitors currently active for the couch_auth_cache process, one is couch_file (<0.320.0>) and one is couch_db_updater (<0.319.0>). Now I repeatedly call reinit_cache and both of those monitors are duplicated each time:

dupes

Now when I remove the couch_db:monitor/1 invocation and repeat the same procedure again I can see that only couch_file monitors are duplicated:

couch_file

In fact this matches with what I've seen on our production nodes where there were only couch_file monitors duplicates (several million of them) but no duplicated couch_db_updater monitors.

Is it possible that those monitors were the ones you saw during your debugging, and that the couch_file monitors you observed in production have a different root cause?

I'm not fully excluding this possibility. However one interesting correlation as per my first post is that accumulating of these monitors and subsequent slow pread_iolist calls exactly correspond to a spike in couchdb.auth_cache_misses. What else could trigger this if not a cold cache?

Potentially I'm missing some changes between master and 2.3.0 which addressed this issue?

Taking a step back, it's not at all obvious to me that we still need couch_auth_cache as a running service, especially now that the backdoor HTTP port has been removed.

This is actually great news. If someone from the dev team can indeed confirm that I can shutdown this process safely without any impact it would be wonderful. I'm no so interested in fixing the root cause of this especially taking into consideration that this is an outdated (legacy) module. I merely don't want this issue to return.

AlexanderKaraberov commented 4 years ago

Hello @iilyak ,

Some servers are running 21.3.8.8 and some 21.3.8.11 so that bug shouldn't be related.

kocolosk commented 4 years ago

Reviewing the code it seemed to me that couch_auth_cache was safe to remove as a running service, so that’s what I did in #2498. I had to disable a few tests that were written to run against the “backdoor” port, but didn’t see any issues with the rest of the test suite. If you feel like cherry-picking f4356a32d37e0c26b1423026bd17fc0c94c6dcb5 I suspect you can do so safely

AlexanderKaraberov commented 4 years ago

Thanks! This looks good to me. I'll give it a try

AlexanderKaraberov commented 4 years ago

I will close this ticket because the linked PR will prevent my issue from happening again in future, and I'm glad to hear that couch_auth_cache will be removed soon. Feel free to reopen if you have some additional insights or piece of advice.