Open dsshane opened 4 years ago
We're not trying to actively load balance here so it wouldn't make sense to add a parameter to disable anything. What ends up happening is that when we wait for quorum we essentially pick a random representative which may have references to attachments on a remote node.
However, we could add some smarts in fabric_doc_open
that could prefer to use a reply from the local node if it's part of the quorum response. It'd not be a non-trivial change as there's a bit of subtlety in that logic around what gets included as the quorum representative.
@davisp Thanks a lot! Agree that the adding smarts in fabric_doc_open is better. Is it possible to address this in next release, let's say 2.4. This problem (performance degradation with cluster mode) is preventing us to adopt the cluster mode.
We won't be able to land this in the next release (3.0), but I do think an enhancement here makes sense. One technique that I've seen in other systems is
We tested some behavior with 3 zone cluster (each zone with 5 nodes, n=3, q=1, and placement is one in each zone{a,b,c}). For us, we use network impairment tools so that there is 60ms RTD between each zone. We used CouchDB2.3.1
1. Terms/Definition
a. Client – This is the host that initiates the query to couchdb's port 5984
b. Couchdb_QUERY_NODE – This is the couchdb node in cluster that receives the database query from Client on port 5984. This node may or may NOT be the node that holds shard for the database.
c. Couchdb_METALOOKUP_NODE – This is the couchdb node that Couchdb_QUERY_NODE queries for some meta info (not sure what it is). Couchdb_METALOOKUP_NODE is a node in Couchdb_DATA_NODES. The selection of this Couchdb_METALOOKUP_NODE
i The selection of Couchdb_METALOOKUP_NODE is based on "by_range" key in the couchdb:5986/dbs/mydb. The first one in the array ia picked.
d. Couchdb_DATA_NODES – This is the set of couchdb nodes that actually hold a copy of the database asked by the query.
2. General data flow we observed:
a. General data flow for doc query:
i. Client -> Couchdb_QUERY_NODE:5984
ii. If Couchdb_QUERY_NODE NOT is NOT Couchdb_DATA_NODES, Couchdb_QUERY_NODE -> Couchdb_METALOOKUP_NODE:11500
1. This selection is determinitic based on 1.c.i. Suppose Couchdb_DATA_NODES in zonea for mydb is first in "by_range" key, it will always be queried for this phase. This makes queries into mydb from zonec and zonb having an additional 60ms RTD network delay compared to zonea.
iii. Couchdb_QUERY_NODE -> “three Couchdb_DATA_NODES”:11500
1. Once enough Couchdb_DATA_NODE’s (default read quorum is 2 when n=3) returns data, this phase stops
iv. Couchdb_QUERY_NODE->Client with query result
b. View query largely follows the same as doc. Except for the following:
i. Couchdb_QUERY_NODE seems to cache the View definition/metadata
1. During the first query to /mydb/_view/myview, it will retrieve the the view doc following 2.a process
a. subsequent query to /mydb/_view/myview would bypass this.
2. When Couchdb_QUERY_NODE actually retrieve the myview result, it seems to ONLY query the Couchdb_DATA_NODES in the SAME zone as itself. This is good as it saves bandwidth for large returns between zones.
We haven't tested attachment retrieve yet, but it seems to me that it should follow the same view query logic in 2.b.2 if not already. We will try to test this some time next week.
Also, not sure if this needs to be a different ticket, but we would really like to see 2.a.ii.1 to be optimized so that it would query its local zone Couchdb_METALOOKUP_NODE first. Currently, we plan to workaround this by change the "by_range" order in 5986/dbs/mydb to favorite the primary zone for our service.
Heya @nickimho, great to see those empirical results 👍
Regarding 2.b.2
, I believe the behavior depends on the query parameters. If you supply stale=ok
or stable=true
on your request, the view data will always be retrieved from the closest possible DATA_NODE. If you do not, CouchDB will ask every DATA_NODE for the data and stream the results from the first one to respond (which is likely to be the closest copy but not required). It's possible in the second situation for a temporarily slow in-zone DATA_NODE to lose the race to a remote one and the remote DATA_NODE will stream the view results to the QUERY_NODE as a result.
Regarding 2.a.ii.1
... that's a surprising result. I thought we already were preferring the closest possible METALOOKUP_NODE using this code in fabric_util:get_db/2
:
Can you say how you tracked down that the traffic was flowing to the first node in the by_range
array? Perhaps we've overlooked some code path in that zonal prioritization; if so, that should be a quick fix.
Oh, and regarding attachments ... unfortunately they share less in common with view streaming than you might imagine. I had to go back and re-read the code, but it looks like we select one of the DATA_NODES that contributed to the quorum from 1.iii.1
essentially at random and have it stream the attachment data back to the QUERY_NODE. There's no zonal affinity logic being applied.
Thanks @kocolosk, for the detailed info!
I just want say that I am not a coder, and we have been looking at this from a black box point of view so far. So, please forgive if I am not very precise in some of the discussion or not using the right terms =) .
At a high level, here is how we setup the environment:
For data points I provided above:
For 2.b.2, we ran the view tests without any parameters specified. We noticed for a given mydb, one zone always performs better (almost about 60ms). This become more clear if we use "time curl Couchdb_QUERY_NODE:5984/mydb/mydoc?r=1 #from Client of the same zone" (r=1 eliminated the delay from 2.a.iii before returning result). The preferred node (the first entry in "by_range" if i remember correctly), will always get pick for Couchdb_METALOOKUP_NODE, resulting in almost no network delay for the query. Where as the other two zones will always have the added 60ms. We then tweak the :5986/dbs/mydb doc to re-arrange the "by_range" array and saw that the Couchdb_METALOOKUP_NODE follows the first one in the array.
Actually, I should check the notes later on this; there might be an except to this:
And thanks for the insight on attachment! This is good to know. For what we do now, we should be OK (as long as this behavior also exists in BigCouch which is where we are upgrading from). I will have my team run the analysis against BigCouch and CouchDB next week and will provide the result here later. In general, we see attachments as bigger data retrieval and the higher layer application should have logic to handle more delay and caching; also, we are moving attachment to external storage in general and just use doc as a pointer to those external resource. We do want doc and view query be as optimized as possible (or at least consistent in all zones) for better user experience.
@kocolosk
Actually, the commit you provided (https://github.com/apache/couchdb/blob/dd1b2817bbf7a0efce858414310a0c822ce89468/src/fabric/src/fabric_util.erl#L93-L105 ) is on 2019-04-03. The build we are using, CouchDB2.3.1, was released on Feb right? So, maybe we just need a newer build =)
I just looked it up, and this was the build we used:
rpm -qi couchdb Name : couchdb Version : 2.3.1 Release : 1.el7 Architecture: x86_64 Install Date: Mon 25 Mar 2019 05:57:48 PM GMT Group : Applications/Databases Size : 43583049 License : Apache License v2.0 Signature : (none) Source RPM : couchdb-2.3.1-1.el7.src.rpm Build Date : Mon 11 Mar 2019 10:33:54 PM GMT Build Host : d6c906bce27a Relocations : /opt/couchdb Packager : CouchDB Developers dev@couchdb.apache.org Vendor : The Apache Software Foundation URL : https://couchdb.apache.org/ Summary : RESTful document oriented database
No I just grabbed the permalink to the head of the master branch. If you do the git blame
you'll see that functionality was introduced as part of the 2.0.0 release in https://github.com/apache/couchdb/commit/39c0b24c2ee07c2798375fe40b87868eefe11511
@kocolosk
We brought up test environment and recheck the behavior, I updated 2.a.ii in https://github.com/apache/couchdb/issues/2329#issuecomment-573249534 . The Couchdb_METALOOKUP_NODE lookup is skipped if Couchdb_QUERY_NODE holds the shard.
We also reconfirmed that the Couchdb_METALOOKUP_NODE lookup is based on "by_range"
@kocolosk Regarding the 39c0b24 fix you mentioned, I did a quick search in src folders and only "get_revs_limit", "get_purge_infos_limit", and "get_security" seem to use it.
Here is an example of the data sent for the 2.a.ii step for Couchdb_QUERY_NODE -> Couchdb_METALOOKUP_NODE:11500 . It seems to have more to do with shard lookup or confirming revision is up to date (i am just guessing here). The capture below is extracted using "Show Packet Bytes" from Wireshark. " ..D ..' .w Y. ...+.H.Z.h.a.gR -. ..R.R.h.R.h.R.h.gR -. ..r .R . ... . . k b333ac5a7ah.R.R.l .m Nshards/00000000-ffffffff/account/00/08/6546027c9afebd7cf04448bb0f8c.1570251572l .h.R.adh.R.R h.R h.R R.l .m ._adminjm .defaultjj "
Add option to enforce fetching data from local shards, instead of from shards/copies on on the remote nodes(if data present on local node), something like GET /mydb/doc1?preferlocal=true.
Summary
In the current software, it seems CouchDB will always try to distribute disk i/o across nodes during query.
For example: in a two nodes cluster(MACHINE1.SOMECOMPANY.COM and MACHINE2.SOMECOMPANY.COM), database mydb was created with q=8 and n=2, as following. Each node contains a copy of all the data. Documents in the example database contain large binary data(i.e, as attachments.)
When querying/fetching data from mydb, it seems CouchDB may load some data from local shards, and some from shards on the remote machine. Due to the large document size( thus the heavy network traffic to to transfer data from the remote note the the coordinate node), the query speed is much slower than the standalone deployment.
Query is 2+ times faster with the standalone deployment for the same testing data.
Adding ?r=1 did not help.
Possible Solution
It will be very helpful to add an option in the query request, something like GET /mydb/id0?preferlocal=true, to enforce fetching data from shards on local node if it presents locally; otherwise, fetching from remote shards.
Additional context
Software Tested: CouchDB 2.2 OS: Linux.