Add option to enforce fetching data from local shards, instead of from shards on on remote nodes(if data present on local node)

dsshane commented 4 years ago

Add option to enforce fetching data from local shards, instead of from shards/copies on on the remote nodes(if data present on local node), something like GET /mydb/doc1?preferlocal=true.

Summary

In the current software, it seems CouchDB will always try to distribute disk i/o across nodes during query.

For example: in a two nodes cluster(MACHINE1.SOMECOMPANY.COM and MACHINE2.SOMECOMPANY.COM), database mydb was created with q=8 and n=2, as following. Each node contains a copy of all the data. Documents in the example database contain large binary data(i.e, as attachments.)

GET /_membership
{
    "all_nodes": [
        "couchdb@MACHINE1.SOMECOMPANY.COM",
        "couchdb@MACHINE2.SOMECOMPANY.COM"
    ],
    "cluster_nodes": [
        "couchdb@MACHINE1.SOMECOMPANY.COM",
        "couchdb@MACHINE2.SOMECOMPANY.COM"
    ]
}

GET /mydb
{
    "db_name": "mydb",
    "update_seq": "167626-g1AAAAGbeJzLYWBg4MhgTmEQT84vTc5ISXIIcAn2MTLRcw0O8tRz9vfNASpgSmRIkv___39WEgND4B2CqpMUgGSSPUzDbEwNZmgaHEAa4mEazAhrSABpqIdqCHImqCGPBUgyNAApoJ75YE0TiNS0AKJpP9hpxGo6ANF0n8gAgGh6ANEE8ZN8FgBSO3e3",
    "sizes": {
        "file": 31189227296,
        "external": 30639090294,
        "active": 30491445284
    },
    "purge_seq": 0,
    "other": {
        "data_size": 30639090294
    },
    "doc_del_count": 0,
    "doc_count": 166294,
    "disk_size": 31189227296,
    "disk_format_version": 6,
    "data_size": 30491445284,
    "compact_running": false,
    "cluster": {
        "q": 8,
        "n": 2,
        "w": 2,
        "r": 2
    },
    "instance_start_time": "0"
}

GET /mydb/_shards
{
    "shards": {
        "00000000-1fffffff": [
            "couchdb@MACHINE1.SOMECOMPANY.COM",
            "couchdb@MACHINE2.SOMECOMPANY.COM"
        ],
        "20000000-3fffffff": [
            "couchdb@MACHINE1.SOMECOMPANY.COM",
            "couchdb@MACHINE2.SOMECOMPANY.COM"
        ],
        "40000000-5fffffff": [
            "couchdb@MACHINE1.SOMECOMPANY.COM",
            "couchdb@MACHINE2.SOMECOMPANY.COM"
        ],
        "60000000-7fffffff": [
            "couchdb@MACHINE1.SOMECOMPANY.COM",
            "couchdb@MACHINE2.SOMECOMPANY.COM"
        ],
        "80000000-9fffffff": [
            "couchdb@MACHINE1.SOMECOMPANY.COM",
            "couchdb@MACHINE2.SOMECOMPANY.COM"
        ],
        "a0000000-bfffffff": [
            "couchdb@MACHINE1.SOMECOMPANY.COM",
            "couchdb@MACHINE2.SOMECOMPANY.COM"
        ],
        "c0000000-dfffffff": [
            "couchdb@MACHINE1.SOMECOMPANY.COM",
            "couchdb@MACHINE2.SOMECOMPANY.COM"
        ],
        "e0000000-ffffffff": [
            "couchdb@MACHINE1.SOMECOMPANY.COM",
            "couchdb@MACHINE2.SOMECOMPANY.COM"
        ]
    }
}

When querying/fetching data from mydb, it seems CouchDB may load some data from local shards, and some from shards on the remote machine. Due to the large document size( thus the heavy network traffic to to transfer data from the remote note the the coordinate node), the query speed is much slower than the standalone deployment.

Query is 2+ times faster with the standalone deployment for the same testing data.

Adding ?r=1 did not help.

Possible Solution

It will be very helpful to add an option in the query request, something like GET /mydb/id0?preferlocal=true, to enforce fetching data from shards on local node if it presents locally; otherwise, fetching from remote shards.

Additional context

Software Tested: CouchDB 2.2 OS: Linux.

davisp commented 4 years ago

We're not trying to actively load balance here so it wouldn't make sense to add a parameter to disable anything. What ends up happening is that when we wait for quorum we essentially pick a random representative which may have references to attachments on a remote node.

However, we could add some smarts in fabric_doc_open that could prefer to use a reply from the local node if it's part of the quorum response. It'd not be a non-trivial change as there's a bit of subtlety in that logic around what gets included as the quorum representative.

dsshane commented 4 years ago

@davisp Thanks a lot! Agree that the adding smarts in fabric_doc_open is better. Is it possible to address this in next release, let's say 2.4. This problem (performance degradation with cluster mode) is preventing us to adopt the cluster mode.

kocolosk commented 4 years ago

We won't be able to land this in the next release (3.0), but I do think an enhancement here makes sense. One technique that I've seen in other systems is

ask every replica for the revision metadata to compute the quorum
ask the "nearest" replica for the actual data (can be part of the same message above)
if the nearest replica is not part of the quorum or serves stale data per the quorum computation, fire off requests for the data to the other replicas and use the first one to respond.

nickimho commented 4 years ago

We tested some behavior with 3 zone cluster (each zone with 5 nodes, n=3, q=1, and placement is one in each zone{a,b,c}). For us, we use network impairment tools so that there is 60ms RTD between each zone. We used CouchDB2.3.1

1. Terms/Definition
            a. Client – This is the host that initiates the query to couchdb's port 5984
            b. Couchdb_QUERY_NODE – This is the couchdb node in cluster that receives the database query from Client on port 5984. This node may or may NOT be the node that holds shard for the database. 
            c. Couchdb_METALOOKUP_NODE – This is the couchdb node that Couchdb_QUERY_NODE queries for some meta info (not sure what it is). Couchdb_METALOOKUP_NODE is a node in Couchdb_DATA_NODES. The selection of this Couchdb_METALOOKUP_NODE
                i The selection of Couchdb_METALOOKUP_NODE is based on "by_range" key in the couchdb:5986/dbs/mydb. The first one in the array ia picked.
            d. Couchdb_DATA_NODES – This is the set of couchdb nodes that actually hold a copy of the database asked by the query.
2.  General data flow we observed:
            a.  General data flow for doc query:
                i.  Client -> Couchdb_QUERY_NODE:5984
                ii. If Couchdb_QUERY_NODE NOT is NOT Couchdb_DATA_NODES,  Couchdb_QUERY_NODE -> Couchdb_METALOOKUP_NODE:11500  
                    1. This selection is determinitic based on 1.c.i. Suppose Couchdb_DATA_NODES in zonea for mydb is first in "by_range" key, it will always be queried for this phase. This makes queries into mydb from zonec and zonb having an additional 60ms RTD network delay compared to zonea.  
                iii. Couchdb_QUERY_NODE -> “three Couchdb_DATA_NODES”:11500
                    1. Once enough Couchdb_DATA_NODE’s (default read quorum is 2 when n=3) returns data, this phase stops
                iv.  Couchdb_QUERY_NODE->Client with query result
            b. View query largely follows the same as doc. Except for the following:
                i. Couchdb_QUERY_NODE seems to cache the View definition/metadata
                    1. During the first query to /mydb/_view/myview, it will retrieve the the view doc following 2.a process
                        a. subsequent query to /mydb/_view/myview would bypass this. 
                2. When Couchdb_QUERY_NODE actually retrieve the myview result, it seems to ONLY query the Couchdb_DATA_NODES in the SAME zone as itself. This is good as it saves bandwidth for large returns between zones.

We haven't tested attachment retrieve yet, but it seems to me that it should follow the same view query logic in 2.b.2 if not already. We will try to test this some time next week.

Also, not sure if this needs to be a different ticket, but we would really like to see 2.a.ii.1 to be optimized so that it would query its local zone Couchdb_METALOOKUP_NODE first. Currently, we plan to workaround this by change the "by_range" order in 5986/dbs/mydb to favorite the primary zone for our service.

kocolosk commented 4 years ago

Heya @nickimho, great to see those empirical results 👍

Regarding 2.b.2, I believe the behavior depends on the query parameters. If you supply stale=ok or stable=true on your request, the view data will always be retrieved from the closest possible DATA_NODE. If you do not, CouchDB will ask every DATA_NODE for the data and stream the results from the first one to respond (which is likely to be the closest copy but not required). It's possible in the second situation for a temporarily slow in-zone DATA_NODE to lose the race to a remote one and the remote DATA_NODE will stream the view results to the QUERY_NODE as a result.

Regarding 2.a.ii.1 ... that's a surprising result. I thought we already were preferring the closest possible METALOOKUP_NODE using this code in fabric_util:get_db/2:

https://github.com/apache/couchdb/blob/dd1b2817bbf7a0efce858414310a0c822ce89468/src/fabric/src/fabric_util.erl#L93-L105

Can you say how you tracked down that the traffic was flowing to the first node in the by_range array? Perhaps we've overlooked some code path in that zonal prioritization; if so, that should be a quick fix.

kocolosk commented 4 years ago

Oh, and regarding attachments ... unfortunately they share less in common with view streaming than you might imagine. I had to go back and re-read the code, but it looks like we select one of the DATA_NODES that contributed to the quorum from 1.iii.1 essentially at random and have it stream the attachment data back to the QUERY_NODE. There's no zonal affinity logic being applied.

nickimho commented 4 years ago

Thanks @kocolosk, for the detailed info!

I just want say that I am not a coder, and we have been looking at this from a black box point of view so far. So, please forgive if I am not very precise in some of the discussion or not using the right terms =) .

At a high level, here is how we setup the environment:

15 CouchDB nodes, 5 in each zone (couchdb{1,2,3,4,5}-zone{a,b,c}) 1.a. q=1, n=3,
Each zone has it own IP subnet 3, We Netropy (impairment tool) as the default gateway for those subnets, and impair between them with 60ms RTD.
In Each subnet, we setup a client host (client-zone{a,b,c}) to generate the query using CURL command
For each test, we basically run tcpdump on all CouchDB nodes (especially QUERY_NODE to analysis where delays are coming from). In general, we are focusing on the potential network delays. 5.a We saw some i/o delays too if mydb is cold in Couchdb_DATA_NODES (I am suspecting due to mydb is not in file descriptor as well could cause this); this is not a concern for us as it is expected the mydb's our service care about would get hit often.

For data points I provided above:

We basically start tcpdump on all CouchDB nodes.
Run the CURL command test case (e.g. "time curl CouchDB2-zonea:5984/mydb/mydoc"
Stop tcpdump's
We start looking at Couchdb_QUERY_NODE's pcap using wireshark and the rest is just following the packets to understand the network interactions between all the nodes.

For 2.b.2, we ran the view tests without any parameters specified. We noticed for a given mydb, one zone always performs better (almost about 60ms). This become more clear if we use "time curl Couchdb_QUERY_NODE:5984/mydb/mydoc?r=1 #from Client of the same zone" (r=1 eliminated the delay from 2.a.iii before returning result). The preferred node (the first entry in "by_range" if i remember correctly), will always get pick for Couchdb_METALOOKUP_NODE, resulting in almost no network delay for the query. Where as the other two zones will always have the added 60ms. We then tweak the :5986/dbs/mydb doc to re-arrange the "by_range" array and saw that the Couchdb_METALOOKUP_NODE follows the first one in the array.

Actually, I should check the notes later on this; there might be an except to this:

If Couchdb_QUERY_NODE is one of the Couchdb_DATA_NODES, 2.b.2, I think there is some difference in how the process work. I will provide you with this info later after I find the notes.

And thanks for the insight on attachment! This is good to know. For what we do now, we should be OK (as long as this behavior also exists in BigCouch which is where we are upgrading from). I will have my team run the analysis against BigCouch and CouchDB next week and will provide the result here later. In general, we see attachments as bigger data retrieval and the higher layer application should have logic to handle more delay and caching; also, we are moving attachment to external storage in general and just use doc as a pointer to those external resource. We do want doc and view query be as optimized as possible (or at least consistent in all zones) for better user experience.

nickimho commented 4 years ago

@kocolosk

Actually, the commit you provided (https://github.com/apache/couchdb/blob/dd1b2817bbf7a0efce858414310a0c822ce89468/src/fabric/src/fabric_util.erl#L93-L105 ) is on 2019-04-03. The build we are using, CouchDB2.3.1, was released on Feb right? So, maybe we just need a newer build =)

I just looked it up, and this was the build we used:

rpm -qi couchdb Name : couchdb Version : 2.3.1 Release : 1.el7 Architecture: x86_64 Install Date: Mon 25 Mar 2019 05:57:48 PM GMT Group : Applications/Databases Size : 43583049 License : Apache License v2.0 Signature : (none) Source RPM : couchdb-2.3.1-1.el7.src.rpm Build Date : Mon 11 Mar 2019 10:33:54 PM GMT Build Host : d6c906bce27a Relocations : /opt/couchdb Packager : CouchDB Developers dev@couchdb.apache.org Vendor : The Apache Software Foundation URL : https://couchdb.apache.org/ Summary : RESTful document oriented database

kocolosk commented 4 years ago

No I just grabbed the permalink to the head of the master branch. If you do the git blame you'll see that functionality was introduced as part of the 2.0.0 release in https://github.com/apache/couchdb/commit/39c0b24c2ee07c2798375fe40b87868eefe11511

nickimho commented 4 years ago

@kocolosk

We brought up test environment and recheck the behavior, I updated 2.a.ii in https://github.com/apache/couchdb/issues/2329#issuecomment-573249534 . The Couchdb_METALOOKUP_NODE lookup is skipped if Couchdb_QUERY_NODE holds the shard.

We also reconfirmed that the Couchdb_METALOOKUP_NODE lookup is based on "by_range"

nickimho commented 4 years ago

@kocolosk Regarding the 39c0b24 fix you mentioned, I did a quick search in src folders and only "get_revs_limit", "get_purge_infos_limit", and "get_security" seem to use it.

Here is an example of the data sent for the 2.a.ii step for Couchdb_QUERY_NODE -> Couchdb_METALOOKUP_NODE:11500 . It seems to have more to do with shard lookup or confirming revision is up to date (i am just guessing here). The capture below is extracted using "Show Packet Bytes" from Wireshark. " ..D ..' .w Y. ...+.H.Z.h.a.gR -. ..R.R.h.R.h.R.h.gR -. ..r .R . ... . . k b333ac5a7ah.R.R.l .m Nshards/00000000-ffffffff/account/00/08/6546027c9afebd7cf04448bb0f8c.1570251572l .h.R.adh.R.R h.R h.R R.l .m ._adminjm .defaultjj "

apache / couchdb