couchbase / sync_gateway

Manages access and synchronization between Couchbase Lite and Couchbase Server
https://www.couchbase.com/products/sync-gateway
Other
448 stars 138 forks source link

Channel filtered sg-replicate stops due to channel removals #2212

Closed ArihantRk closed 7 years ago

ArihantRk commented 7 years ago

syncgateway 1.3.1

couchbase 4.1.1

Operating system : redhat 7.2

SETUP Details

Cluster1 2 syncgateway(1 : assigns channel1 and 2: assigns channel2) 2 syncgateway replicator(1 : replicates channel1 and 2: assigns channel2) 3 couchbase nodes Cluster2 2 syncgateway(1 : assigns channel3 and 2: assigns channel4) 2 syncgateway replicator(1 : replicates channel3 and 2: assigns channel4) 3 couchbase nodes

syncgateway and sg-replicate configs below https://gist.github.com/ArihantRk/40ad192427f9558451c858ccd2497b4c

Each of syncgateway differs by assigning channels, in above config syncgateway assigns channel "channel1" and sg-replicate replicates "channel1"

sg-replicates does only replicates between clusters. source : local target : fqdn(resolves to one of cluster2 sg-replicates)

Log output

we are using channel based replication in syncgateway replicator for scalability purpose

from clients each of requests hits one of syncgateway in random manner. hence revisions of same documents are assigning to different channels.

we are updating same documents within seconds and assigning random channels in each request as request are hitting different syncgateways Request: curl -vX GET 'http://localhost:4985/ptxdata/_changes?filter=sync_gateway/bychannel&channels=channels1' | grep 's91996660656_ptx1483342633728_meta.doc'

{"seq":1558,"id":"s91996660656_ptx1483342633728_meta.doc","removed":["channels1"],"changes":[{"rev":"4-54e6cabaaa63cb75c760e7d2606c7009"}]}

when we query for document for same revision, could see that document does not exists in couchbase Request : curl http://localhost:4985/ptxdata/s91996660656_ptx1483342633728_meta.doc?rev=4-54e6cabaaa63cb75c760e7d2606c7009 {"error":"not_found","reason":"missing"}

bulk get getting response as {"error":"not_found","reason":"missing"} for the same revision and same is sending to bulk post bulk_docs request: {"new_edits":false,"docs":[{"error":"not_found","id":"s91996660656_ptx1483342633728_meta.doc","reason":"missing","rev":"4-54e6cabaaa63cb75c760e7d2606c7009","status":404}]}

Expected behavior

syncgateway replicator should ignore the older revisions if not found in couchbase and replicates latest revision

Actual behavior

syncgateway replicator uses style=all_docs and limit=50 in _changes feed and could see all revisions of same documents are seeing changes feed. but older revisions are not exist in couchbase hence bulk_docs request is failing. due to this replication got stuck , no subsequent documents are not replication.

Steps to reproduce

  1. setup as above mentioned
  2. Insert and update the same document and each request should hits the different syncgateway
  3. on destination, should see bulk docs showing error as POST /ptxdata/_bulk_docs (ADMIN) BulkDocs: Doc "" --> 400 Bad _revisions (400 Bad _revisions)
adamcfraser commented 7 years ago

Additional context here: https://forums.couchbase.com/t/older-document-revision-getting-purged-from-couchbase/11347

@ArihantRk Leaving aside your particular use case, it sounds like the issue that you're hitting is that cluster replication is stopping when you get a 'not found' error in a bulk_docs request. Is that accurate?

adamcfraser commented 7 years ago

It sounds like this is a particular issue related to using channel-based replication in sg-replicate. Channel removal notifications may be sent for inactive revisions (since the removal tracks the rev id at which the doc was removed from the channel), and there's no expectation that the revision body will be available for those revisions.

Couchbase Lite clients don't attempt to retrieve the body for channel removals, so wouldn't hit this issue. I think there are two issues:

  1. Sg-replicate shouldn't stop replication based on a 'not found' for a removal notification.
  2. Should channel-based sg-replicate treat channel removals as deletes from the target environment?

@tleyden I'd be interested in your take on these two issues.

ArihantRk commented 7 years ago

Thanks for your time @adamcfraser . yes, replication is stopping when 'not found' error in bulk_docs request. i would more interested channels removal is pruning the revision in couchbase? and i could not see same issue for all documents. i have set revs_limit=5, revision going beyond 5 . older revision will get pruned. but could some documents 2nd revision is getting pruned. could you please be more detail how pruning happens in couchbase. is revs_limit controlling pruning at couchbase?

adamcfraser commented 7 years ago

You're talking about two different concepts.

Revs_limit determines the size of the revision tree - how much of the revision history is retained for a document. This isn't the full revision body - just the previous revision ids and some additional document metadata.

Bodies of non-leaf revisions aren't retained. A temporary copy is made for use by in-flight replications, but this temporary copy has an expiry of 5 minutes.

Normally non-leaf revision IDs aren't included in the changes response. The exception is channel removal notifications - the changes feed can include an old, non-leaf revision ID to inform consumers that the document was removed from the channel at the specified revision. In this scenario, it's expected that the consumer will not be able to retrieve that revision - not only because the revision body may have expired, but because the document is no longer in the channel that the consumer has access to.

That's the potential issue with channel filtered cluster replication I described above - replication shouldn't treat 'not found' for changes entries marked 'removed' as an error.

ArihantRk commented 7 years ago

@adamcfraser do you mean non-leaf revision bodies will not be retained in syncgateway? when i do CURL for older revisions from syncgateway i can able to get the full older document . does syncgateway fetching from couchbase? how many revisions will be exists in couchbase.
revs_limit is only to syncgateway or does is have any effect on couchbase also.

adamcfraser commented 7 years ago

I think the answers to most of those questions are in my previous post. It's possible that the body of the older revision is still resident in a particular Sync Gateway node's revision cache (such as the node where the revision was originally written), but there are no guarantees around the availability of the bodies of non-leaf revisions.

ArihantRk commented 7 years ago

@adamcfraser , can this fix expected in next release?

adamcfraser commented 7 years ago

I don't expect this will make it in to 1.4 - it will be in the subsequent release.

tleyden commented 7 years ago

I reproduced with these steps

Failure logs before the fix

Passing logs after the fix

Pull Request

tleyden commented 7 years ago

@ArihantRk awesome bug report btw! Hats off to that.

ddash1 commented 7 years ago

Thanks @tleyden @adamcfraser since this is high priority bug, can this is part of 1.4?

tleyden commented 7 years ago

@ddash1 we're working on it! This is a complex use-case that's currently not well supported by sg-replicate.

tleyden commented 7 years ago

@adamcfraser I ran into a complication with dealing with _bulk_docs errors. There's an existing sg-replicate test which asserts retry behavior on errors:

https://github.com/couchbaselabs/sg-replicate/blob/master/synctube_test.go#L590-L595

So it looks like some errors will need to be retried, while other errors will need to be treated as permanent and ignored.

tleyden commented 7 years ago

So it looks like some errors will need to be retried, while other errors will need to be treated as permanent and ignored.

The plan is to focus on the _bulk_get request / response rather than making any changes to the way _bulk_docs works

tleyden commented 7 years ago

I found an interesting twist to the _bulk_get behavior:

If the request does not include a "rev" field, eg:

{
    "docs":[
        {
            "id":"3d694a032fad0994a74691273478df63"
        }
}

and if the latest revision of that doc has been removed from the channel, then the response will be something like:

--c5fb685db7b62ffe23837077047ab5ba68d195b672074d046a871e345c37
Content-Type: application/json; error="true"

{"error":"forbidden","id":"3d694a032fad0994a74691273478df63","reason":"forbidden","status":403}
--c5fb685db7b62ffe23837077047ab5ba68d195b672074d046a871e345c37--

OTOH, if you add the revision to the _bulk_get request, eg:

{
    "docs":[
        {
            "id": "3d694a032fad0994a74691273478df63",
            "rev": "2-4bbb88692d20c37593326c766f205709"
        }
    ]
}

Then the response will be as follows:

--7cce5f1db5cb7d8f377f21dc905493275c658c4691b02ed87c3e83a379f0
Content-Type: application/json

{"_id":"3d694a032fad0994a74691273478df63","_removed":true,"_rev":"2-4bbb88692d20c37593326c766f205709"}
--7cce5f1db5cb7d8f377f21dc905493275c658c4691b02ed87c3e83a379f0--

It's not relevant to this ticket as far as I know, since the requests from the replication code should always contain a rev field, but I thought I'd note it anyway.

@jamiltz you might want to add some notes to the REST API docs if you don't already have some on this.

tleyden commented 7 years ago

PR is up: https://github.com/couchbaselabs/sg-replicate/issues/49

ArihantRk commented 7 years ago

@adamcfraser this issue we have observer during * channel based based replication, is it coped in 1.4?

adamcfraser commented 7 years ago

@ArihantRk If you're seeing this with a channel filter defined (but that channel filter is *), then it might be covered by the existing fix.

If you're seeing this with no channel filter defined, that would be a different issue.

ArihantRk commented 7 years ago

Thanks @adamcfraser , currently we have a channel filter defined (but that channel filter is *. Fix should be available for this..