Open vladimirralev opened 3 years ago
hey hey, just a quick note without going into too much detail. The test you are running (creating a lot a lot of databases in a tight loop) is not a use-case that CouchDB 3.x will be very happy with. I’m sure there are things we can improve, but this isn’t a use-case I see us optimising for a lot, unless someone contributes compelling PRs.
Thanks for the response. I think this loop is a common backup strategy - just replicate all DBs to a backup server as fast as you can overnight or otherwise. Sometimes in parallel.
That being said, this issue has been observed with a slow buildup of databases and the rate of creating the DBs in the example is probably not related to the root cause. I'll be trying to find the root cause of this and any hints are appreciated.
I experienced this problem as well on our clusters. After a lot of trial-and-errors I think the problem is related to the synchronization on the _dbs internal db across nodes. Maybe you can check a little with that.
That being said, this issue has been observed with a slow buildup of databases and the rate of creating the DBs in the example is probably not related to the root cause.
It could be related to the rate of creation of databases. CouchDB uses LRU cache and keeps only limited number of databases in opened state. When you create databases rapidly you exceed the LRU cache size. Since most of the requests are new database creations these are definitely not in the cache. When the LRU cache size is over the limit CouchDB starts closing databases.
I think CouchDB 3 does eager indexing which causes a huge CPU/IO follow-up load asynchronously after a replication is complete for specific DBs. I have to pace the replications based on this, but that's fine and not related to the issue. I am setting up a new build here with more logging in between the lines, but so far it looks like indeed each DB is polling independently for health or at least logs something that is causing the sudden spike of queued messages.
Description
I create 100,000 identical test databases with 100 documents each(or more in other tests) on a 3 node cluster. Then I bring one node down and continue to create databases in the remaining cluster. At this point creating new DBs doesn't work anymore and times out. More tests show there is a gradual slowness buildup noticeable from 10K DBs onwards progressing to a completely unusable state at about 60K DB (when a node is down). When all nodes are up the nodes sync and the cluster is very fast again.
Steps to Reproduce
Build a cluster with 3 machines r4-couch01-03. Create 100K DBs on r4-couch01. Then bring down the r4-couch03 machine and watch the script freeze. Additional replication attempts with the script also fail similarly. Issue is reproducible with 4-node cluster as well.
I use this script to replicate the DB many times on r4-couch01
Expected Behaviour
I expect the cluster to continue working when one node is down, and even with two nodes down for my config.
Your Environment
Settings are defaults from the distro rpm on centos7
Here is one DB stats:
This test was done on CouchDB3.1.1, but the same issue is present on CouchDB2.1 as well. The only known version that doesn't suffer from this is bigcouch (0.4.1)
Same issue is present with DBs of any size tested - from 100 documents to 10K documents.
Additional Context
I tried it with debug logs and with no logging enabled to rule out some excessive logging issue.
Debug logs show this:
... this goes on for a long time and logs stop printing at some point
I did some remsh analysis and took a snapshot of the processes. The only interesting issue I found is some processes had queued messages related to a node being shutdown.
Here are some results: