Closed tleyden closed 7 years ago
@adamcfraser after this change, should it be possible to remove this type assertion?
func (k *KvChangeIndexWriter) rollbackView(vbNo uint16, seq uint64) (channelNames []string, err error) {
channelNames = make([]string, 0)
startKey := []interface{}{vbNo, seq}
endKey := []interface{}{vbNo, uint64(math.MaxUint64)}
if goCbBucket, err := base.GetGoCBBucketFromBaseBucket(k.indexWriteBucket); err == nil {
viewQuery := gocb.NewViewQuery("sg-accel", "rollback")
viewQuery.Range(startKey, endKey, true)
viewQuery.Stale(gocb.Before)
results, err := goCbBucket.Bucket.ExecuteViewQuery(viewQuery)
if err != nil {
return channelNames, err
}
if results != nil {
var vrow struct {
Value string
}
for results.Next(&vrow) {
channelNames = append(channelNames, vrow.Value)
rollbackViewRows.Add(int64(1))
}
// Any error processing view results is returned on Close
err := results.Close()
if err != nil {
return channelNames, err
}
}
} else {
// walrus/go-couchbase view retrieval
var vres struct {
Rows []struct {
Value string
}
}
opts := map[string]interface{}{"stale": false, "startkey": startKey, "endkey": endKey}
err := k.indexWriteBucket.ViewCustom("sg-accel", "rollback", opts, &vres)
if err != nil {
return channelNames, err
}
for _, row := range vres.Rows {
channelNames = append(channelNames, row.Value)
}
rollbackViewRows.Add(int64(len(vres.Rows)))
}
return channelNames, nil
}
and just call down to the ViewCustom()
method in the bucket, even if it's a gocb bucket? I noticed this when looking at the places we do type assertions on buckets.
@tleyden ViewCustom is currently a no-op on the gocb bucket. If I recall correctly, I looked at putting the above code into ViewCustom originally, but the method signatures weren't an ideal fit.
Does gocb have or need the equivalent of go-couchbase refresh?
Or should I ask the SDK team?
I'm thinking Refresh()
can be a no-op in the gocb bucket -- currently it panics
@tleyden I think we're currently only using Refresh when trying to reconnect to a bucket that we've lost connection to. I think we need some way to do the equivalent in gocb, but we might be able to just do something manually on the SG side (drop and reconnect?).
I think it would be good to describe the use case to the SDK team and see what the recommended set of bucket operations would be.
Posted a question to the gosdk forum
Investigate this failure:
pytest -p no:sugar -s --skip-provisioning --mode=cc --server-version=4.6.1 --sync-gateway-version=9ef50d1e967882af2022d847dfb89d71cbce783d -k test_multiple_dbs_unique_buckets_lose_tap[bucket_online_offline/bucket_online_offline_multiple_dbs_unique_buckets-100] testsuites/syncgateway/functional/tests/
why is this returning 404 instead of db offline?
2017-04-17T18:22:31.371-04:00 HTTP: #5100: GET /db1/ (ADMIN)
2017-04-17T18:22:31.371-04:00 HTTP: #5100: --> 404 no such database "db1" (0.1 ms)
Sync Gateway logs when bucket deleted by test:
_time=2017-04-17T18:06:15.382-04:00 _level=INFO _msg= Trying with http://192.168.33.10:8091/pools/default/bucketsStreaming/data-bucket-1
_time=2017-04-17T18:06:15.385-04:00 _level=ERROR _msg=Failed to connect to host, unexpected status code: 401. Body
_time=2017-04-17T18:06:15.385-04:00 _level=INFO _msg= Trying with http://192.168.33.10:8091/pools/default/bucketsStreaming/data-bucket-1
_time=2017-04-17T18:06:15.387-04:00 _level=ERROR _msg=Failed to connect to host, unexpected status code: 401. Body
_time=2017-04-17T18:06:15.387-04:00 _level=INFO _msg= Trying with http://192.168.33.10:8091/pools/default/bucketsStreaming/data-bucket-1
_time=2017-04-17T18:06:15.388-04:00 _level=ERROR _msg=Failed to connect to host, unexpected status code: 401. Body
_time=2017-04-17T18:06:15.388-04:00 _level=INFO _msg= Trying with http://192.168.33.10:8091/pools/default/bucketsStreaming/data-bucket-1
_time=2017-04-17T18:06:15.390-04:00 _level=ERROR _msg=Failed to connect to host, unexpected status code: 401. Body
_time=2017-04-17T18:06:15.390-04:00 _level=ERROR _msg= Maximum failures reached. Exiting loop...
2017-04-17T18:06:15.390-04:00 WARNING: Bucket Updater for bucket data-bucket-1 returned error: Max failures reached. Last Error Failed to connect to host. Status 401 Body -- base.GetCouchbaseBucket.func1() at bucket.go:476
2017-04-17T18:06:15.390-04:00 WARNING: Lost TAP feed for bucket data-bucket-1, with error: Max failures reached. Last Error Failed to connect to host. Status 401 Body -- rest.(*ServerContext)._getOrAddDatabaseFromConfig.func1() at server_context.go:351
2017-04-17T18:06:15.390-04:00 CRUD: Connection to TAP feed for db1 re-established, bringing DB back online
_time=2017-04-17T18:06:15.521-04:00 _level=INFO _msg=go-couchbase: TAP connection lost; reconnecting to bucket "data-bucket-1" in 1s
_time=2017-04-17T18:06:16.440-04:00 _level=ERROR _msg= Unable to decode response EOF
_time=2017-04-17T18:06:16.440-04:00 _level=INFO _msg= Trying with http://192.168.33.10:8091/pools/default/bucketsStreaming/data-bucket-3
_time=2017-04-17T18:06:16.443-04:00 _level=ERROR _msg=Failed to connect to host, unexpected status code: 401. Body
_time=2017-04-17T18:06:16.443-04:00 _level=INFO _msg= Trying with http://192.168.33.10:8091/pools/default/bucketsStreaming/data-bucket-3
_time=2017-04-17T18:06:16.444-04:00 _level=ERROR _msg=Failed to connect to host, unexpected status code: 401. Body
_time=2017-04-17T18:06:16.444-04:00 _level=INFO _msg= Trying with http://192.168.33.10:8091/pools/default/bucketsStreaming/data-bucket-3
_time=2017-04-17T18:06:16.446-04:00 _level=ERROR _msg=Failed to connect to host, unexpected status code: 401. Body
_time=2017-04-17T18:06:16.446-04:00 _level=INFO _msg= Trying with http://192.168.33.10:8091/pools/default/bucketsStreaming/data-bucket-3
_time=2017-04-17T18:06:16.448-04:00 _level=ERROR _msg=Failed to connect to host, unexpected status code: 401. Body
_time=2017-04-17T18:06:16.448-04:00 _level=ERROR _msg= Maximum failures reached. Exiting loop...
2017-04-17T18:06:16.448-04:00 WARNING: Bucket Updater for bucket data-bucket-3 returned error: Max failures reached. Last Error Failed to connect to host. Status 401 Body -- base.GetCouchbaseBucket.func1() at bucket.go:476
2017-04-17T18:06:16.448-04:00 WARNING: Lost TAP feed for bucket data-bucket-3, with error: Max failures reached. Last Error Failed to connect to host. Status 401 Body -- rest.(*ServerContext)._getOrAddDatabaseFromConfig.func1() at server_context.go:351
2017-04-17T18:06:16.448-04:00 CRUD: Connection to TAP feed for db3 re-established, bringing DB back online
_time=2017-04-17T18:06:16.523-04:00 _level=INFO _msg=go-couchbase: TAP connection lost; reconnecting to bucket "data-bucket-1" in 2s
_time=2017-04-17T18:06:16.547-04:00 _level=INFO _msg=go-couchbase: TAP connection lost; reconnecting to bucket "data-bucket-3" in 1s
2017-04-17T18:06:17.480-04:00 HTTP: #3257: PUT /db2/_role/radio_stations (ADMIN)
2017-04-17T18:06:17.481-04:00 HTTP+: #3257: --> 200 OK (0.6 ms)
2017-04-17T18:06:17.484-04:00 HTTP: #3258: GET /db2/_role/ (ADMIN)
_time=2017-04-17T18:06:17.550-04:00 _level=INFO _msg=go-couchbase: TAP connection lost; reconnecting to bucket "data-bucket-3" in 2s
I'm also seeing view_undefined
errors in the logs while running the above test:
2017-04-17T18:05:40.978-04:00 WARNING: Error trying to query ViewPrincipals: unknown_error - view_undefined -- rest.collectAccessRelatedWarnings() at server_context.go:915
and when I try to run Sync Gateway locally, I'm seeing it fail on startup, also due to view_undefined
errors:
2017-04-17T17:39:32.600-07:00 Bucket: ViewCustom("sync_gateway", "access", ...) [24.074254ms]
2017-04-17T17:39:32.600-07:00 WARNING: channelComputer.ComputeChannelsForPrincipal failed on &{{foo 0 <nil>} { false [] <nil> []} 0xc4202e0fa0 []}: unknown_error - view_undefined -- auth.(*Authenticator).rebuildChannels() at auth.go:156
2017-04-17T17:39:32.600-07:00 FATAL: Error opening database: Couldn't create user "foo": unknown_error - view_undefined -- rest.RunServer() at config.go:749
Restarting Sync Gateway fixes the above error.
Logs: https://gist.github.com/tleyden/4e9c5e00b0c7e75bf96dc25b6975b083
I remember seeing this at one point and intending to go back to it.
I haven't seen the view_undefined
error again, even after attempts to reproduce it by creating an empty bucket and restarting Sync Gateway. Keeping an eye out for it for now
With these steps:
I'm getting a 404 when going to http://localhost:4985/db/, and seeing the following log output:
https://gist.github.com/tleyden/2ebb1d275c8d1cbdd63b891f3c3d7f3b
I will try comparing the behavior against the master branch to see where it's differing.
On the master branch, after trying to repro with above steps, going to http://localhost:4985/db/
returns:
{
committed_update_seq: 0,
compact_running: false,
db_name: "db",
disk_format_version: 0,
instance_start_time: 1492482029507746,
purge_seq: 0,
state: "Offline",
update_seq: 0
}
Here are the full SG logs:
https://gist.github.com/tleyden/4e620ab67f6780eb04833bb172eff4c3
Since there are lots of calls to Refresh()
, it's probably related to the fact that Refresh()
is a no-op on the GoCBHybridBucket
Commit 455c97eb1b280b63677019ca76c60518dd522333 fixes the issue mentioned in https://github.com/couchbase/sync_gateway/issues/2423#issuecomment-294651245, but this test is still failing:
pytest -p no:sugar -s --skip-provisioning --mode=cc --server-version=4.6.1 --sync-gateway-version=9ef50d1e967882af2022d847dfb89d71cbce783d -k test_multiple_dbs_unique_buckets_lose_tap[bucket_online_offline/bucket_online_offline_multiple_dbs_unique_buckets-100] testsuites/syncgateway/functional/tests/
However since go-couchbase is being used for the tap feed, I think the hybrid bucket might still want to call down to go-couchbse Refresh()
rather than using the gocb bucket Refresh()
In a551f976cb482b3d3167c2dccd891c4421d0d0f1 I switched back to the go-couchbase Refresh() implementation, and now the bucket_online_offline_multiple_dbs_unique_buckets-100]
functional test passes.
I'm running the functional tests against the latest commit, and seeing more failures: http://uberjenkins.sc.couchbase.com/job/cen7-sync-gateway-functional-tests-base-cc/470/console
This failed locally for me as well:
pytest -p no:sugar -s --skip-provisioning --mode=cc -k test_bucket_online_offline_resync_sanity[bucket_online_offline/db_online_offline_access_all-5-100-10] testsuites/syncgateway/functional/tests/
Todo:
func (bucket CouchbaseBucket) StartDCPFeed(
force it to use a go-couchbase bucket since we always want to use go-couchbase for DCP feeds db.ConnectToBucket(spec, func(bucket string, err error) {
to db.ConnectToBucket(spec, func(bucket CouchbaseBucket, err error) {
so that it calls back to the go-couchbase Refresh()
rather than gocb Refresh()
Functional tests in progress:
http://uberjenkins.sc.couchbase.com/job/cen7-sync-gateway-functional-tests-base-cc/487/console
I see one failure:
testsuites/syncgateway/functional/tests/test_db_online_offline.py::test_multiple_dbs_unique_buckets_lose_tap[bucket_online_offline/bucket_online_offline_multiple_dbs_unique_buckets-100] (fixtures used: environment, num_docs, params_from_base_suite_setup, params_from_base_test_setup, sg_conf_name)FAILED
but running locally passed:
pytest -s --mode=cc --skip-provisioning -k "test_multiple_dbs_unique_buckets_lose_tap[bucket_online_offline/bucket_online_offline_multiple_dbs_unique_buckets-100]" testsuites/syncgateway/functional/tests/
SG Accel not starting up due to this error:
2017-04-20T19:04:36.920-07:00 FATAL: Error initializing SG Accel context: Error initializing CBGT: GoCB bucket does not expose UUID -- sg_accel.RunServer() at server.go:48
This test is failing on jenkins and locally:
pytest -s --mode=cc --skip-provisioning -k "test_user_views_sanity[user_views/user_views]" testsuites/syncgateway/functional/tests/
and digging into it, I see this panic in the sg logs:
2017-04-20 10:06:51.829211 I | http: panic serving 192.168.33.1:56195: interface conversion: interface is string, not bool
goroutine 480 [running]:
net/http.(*conn).serve.func1(0xc42030e300)
/usr/local/go/src/net/http/server.go:1491 +0x12a
panic(0xad3880, 0xc42011eb00)
/usr/local/go/src/runtime/panic.go:458 +0x243
github.com/couchbase/sync_gateway/base.applyViewQueryOptions(0xc42011eac0, 0xc420f1e690)
/home/centos/godeps/src/github.com/couchbase/sync_gateway/base/bucket_gocb.go:1553 +0x1419
github.com/couchbase/sync_gateway/base.CouchbaseBucketGoCB.View(0xc42007f200, 0xc42010bc40, 0x19, 0xb91d43, 0x7, 0xc4200ffa40, 0xb, 0xc4200ffda3, 0x3, 0xf39900, ...)
/home/centos/godeps/src/github.com/couchbase/sync_gateway/base/bucket_gocb.go:1360 +0x17d
github.com/couchbase/sync_gateway/base.(*CouchbaseBucketGoCB).View(0xc42007f290, 0xc42011e5d0, 0xa, 0xc42011e5e1, 0x8, 0xc420f1e690, 0x0, 0x0, 0x0, 0x0, ...)
<autogenerated>:482 +0xfa
github.com/couchbase/sync_gateway/db.(*Database).QueryDesignDoc(0xc421f59620, 0xc42011e5d0, 0xa, 0xc42011e5e1, 0x8, 0xc420f1e690, 0x3, 0x0, 0x316a203800512ac8)
/home/centos/godeps/src/github.com/couchbase/sync_gateway/db/design_doc.go:190 +0xff
github.com/couchbase/sync_gateway/rest.(*handler).handleView(0xc420d0aab0, 0xc421f59620, 0x0)
/home/centos/godeps/src/github.com/couchbase/sync_gateway/rest/view_api.go:105 +0x594
github.com/couchbase/sync_gateway/rest.(*handler).invoke(0xc420d0aab0, 0xc4c5c8, 0x0, 0x0)
/home/centos/godeps/src/github.com/couchbase/sync_gateway/rest/handler.go:221 +0x2c1
github.com/couchbase/sync_gateway/rest.makeHandler.func1(0xf3fd40, 0xc422166a90, 0xc423568d20)
/home/centos/godeps/src/github.com/couchbase/sync_gateway/rest/handler.go:96 +0x80
net/http.HandlerFunc.ServeHTTP(0xc4203ff9a0, 0xf3fd40, 0xc422166a90, 0xc423568d20)
/usr/local/go/src/net/http/server.go:1726 +0x44
github.com/gorilla/mux.(*Router).ServeHTTP(0xc4203082d0, 0xf3fd40, 0xc422166a90, 0xc423568d20)
/home/centos/godeps/src/github.com/gorilla/mux/mux.go:86 +0x204
github.com/couchbase/sync_gateway/rest.wrapRouter.func1(0xf3fd40, 0xc422166a90, 0xc423568d20)
/home/centos/godeps/src/github.com/couchbase/sync_gateway/rest/routing.go:297 +0x66f
net/http.HandlerFunc.ServeHTTP(0xc4203368a0, 0xf3fd40, 0xc422166a90, 0xc423568d20)
/usr/local/go/src/net/http/server.go:1726 +0x44
net/http.serverHandler.ServeHTTP(0xc4203a7980, 0xf3fd40, 0xc422166a90, 0xc423568d20)
/usr/local/go/src/net/http/server.go:2202 +0x7d
net/http.(*conn).serve(0xc42030e300, 0xf40900, 0xc42011e580)
/usr/local/go/src/net/http/server.go:1579 +0x4b7
created by net/http.(*Server).Serve
/usr/local/go/src/net/http/server.go:2293 +0x44d
2017-04-20T10:06:57.389-04:00 HTTP: #024: GET /db/_design/test_views/_view/filtered?stale=false (as seth)
2017-04-20T10:06:57.389-04:00 HTTP: JSON view "test_views"/"filtered" - opts map[stale:false]
Now digging into this error:
pytest -s --mode=cc --skip-provisioning -k "test_overloaded_channel_cache[sync_gateway_channel_cache-5000-*-True-50]" testsuites/syncgateway/functional/tests/
resulting in failure: https://gist.github.com/tleyden/6590e8976923756ff1a7633614669da3
If I reduce the number of users and docs in the functional test, it passes consistently.
After digging into this, it looks like the problem is that go-cb has a slightly different behavior with regards to opening concurrent view query connections to couchbase server, and doesn't have any throttling.
Against this branch I was seeing upwards of 900 connections from sync gw -> couchbase server on the views port:
[root@localhost sync_gateway]# netstat -an | grep -i 8092 | wc -l
902
where as with the master branch, it never went above 100:
[root@localhost sync_gateway]# netstat -an | grep -i 8092 | wc -l
100
Planned fix: extend the 'MaxConcurrentBulkOps' technique used in bucket_gocb to a 'MaxConcurrentViewOps' setting, and use that to throttle
The above fix seems to improved things, but now this job http://uberjenkins.sc.couchbase.com/job/cen7-sync-gateway-functional-tests-base-cc/499/artifact/results/report.html is failing due to:
if user_channels == "*" and num_docs == 5000:
# "*" channel includes _user docs so the verify_changes will result in 10 view queries
> assert resp_obj["syncGateway_changeCache"]["view_queries"] == 10
E KeyError: 'view_queries'
On master branch:
#####
T +0.008878 192.168.33.1:60480 -> 192.168.33.11:4984 [AP]
POST /db/_changes HTTP/1.1.
Host: 192.168.33.11:4984.
Connection: keep-alive.
Accept-Encoding: gzip, deflate.
Accept: */*.
User-Agent: python-requests/2.13.0.
Content-Type: application/json.
Authorization: Basic dXNlcl8wOnBhc3N3b3Jk.
Content-Length: 22.
.
{"include_docs": true}
##
T +0.065946 192.168.33.11:4984 -> 192.168.33.1:60480 [AP]
HTTP/1.1 200 OK.
Cache-Control: private, max-age=0, no-cache, no-store.
Content-Type: application/json.
Server: Couchbase Sync Gateway/1.4 branch/HEAD(nobranch) commit/d00b604.
Date: Mon, 24 Apr 2017 21:25:50 GMT.
Transfer-Encoding: chunked.
.
800.
{"results":[.
{"seq":1,"id":"_user/user_0","changes":[]}
,{"seq":1002,"id":"7c0ded56-308c-496a-8062-77223013fa88","doc":{"_id":"7c0ded56-308c-496a-8062-77223013fa88","_rev":"1-c3bcf352dc26b1790d78cfe030738faf","channels":["ABC"],"updates":0},"changes":[{"rev":"1-c3bcf352dc26b1790d78cfe030738faf"}]}
,{"seq":1003,"id":"738094a1-28a5-4322-9d8d-0dcf9b3018cc","doc":{"_id":"738094a1-28a5-4322-9d8d-0dcf9b3018cc","_rev":"1-c3bcf352dc26b1790d78cfe030738faf","channels":["ABC"],"updates":0},"changes":[{"rev":"1-c3bcf352dc26b1790d78cfe030738faf"}]}
test output:
|user_4| expected (num_docs: 5000 num_revisions: 0) _changes (num_docs: 5000 updates: 0)
-> |user_6| expected (num_docs: 5000 num_revisions: 0) _changes (num_docs: 5000 updates: 0)
-> |user_36| expected (num_docs: 5000 num_revisions: 0) _changes (num_docs: 5000 updates: 0)
-> |user_34| expected (num_docs: 5000 num_revisions: 0) _changes (num_docs: 5000 updates: 0)
-> |user_44| expected (num_docs: 5000 num_revisions: 0) _changes (num_docs: 5000 updates: 0)
-> |user_43| expected (num_docs: 5000 num_revisions: 0) _changes (num_docs: 5000 updates: 0)
-> |user_30| expected (num_docs: 5000 num_revisions: 0) _changes (num_docs: 5000 updates: 0)
-> |user_32| expected (num_docs: 5000 num_revisions: 0) _changes (num_docs: 5000 updates: 0)
syncGateway_changeCache: {u'maxPending': 10, u'lag-tap-0000ms': 4997, u'view_queries': 10, u'lag-total-9223372036800ms': 1001, u'lag-queue-0000ms': 6001, u'lag-total-0000ms': 4997, u'lag-total-0100ms': 3, u'lag-tap-0100ms': 3}
TEARDOWN F limit[50]
TEARDOWN F filter[True]
TEARDOWN F user_channels[*]
TEARDOWN F num_docs[5000]
TEARDOWN F sg_conf_name[sync_gateway_channel_cache]Tearing down test 'test_overloaded_channel_cache[sync_gateway_channel_cache-5000-*-True-50]'
----------------
Socket usage on mobile-testkit client ...
ESTABLISHED: 48
TIME_WAIT: 0
resources/cluster_configs/base_cc
cbs: [{'ip': u'192.168.33.10', 'name': u'cb1'}]
sgs: [{'ip': u'192.168.33.11', 'name': u'sg1'}]
acs: []
ssl: False
sync_gateway: http://192.168.33.11:4984, info: {"couchdb":"Welcome","vendor":{"name":"Couchbase Sync Gateway","version":1.4},"version":"Couchbase Sync Gateway/HEAD(nobranch)(d00b604)"}
TEARDOWN F params_from_base_test_setupTearing down 'params_from_base_suite_setup' ...
TEARDOWN S params_from_base_suite_setup
testsuites/syncgateway/functional/tests/test_overloaded_channel_cache.pytest_overloaded_channel_cache[sync_gateway_channel_cache-5000-*-True-50] ✓100% ██████████
----------------------------------- generated xml file: /Users/tleyden/Development/mobile-testkit/results/results.xml -----------------------------------
---------------------------------- generated html file: /Users/tleyden/Development/mobile-testkit/results/report.html -----------------------------------
================================================================= 127 tests deselected ==================================================================
Results (251.16s):
1 passed
127 deselected
During testing, ran into https://github.com/couchbaselabs/mobile-testkit/issues/1142
This test (and the rest of the bucket shadowing tests) is failing locally:
pytest -s -p no:sugar --mode=cc -k test_bucket_shadow_low_revs_limit --server-version=4.5.1 --sync-gateway-version=674d1a9482c61d606d77831cecd312ce1f77ed43 testsuites/syncgateway/functional/topology_specific_tests/
Even after adding code to the functional test to wait until the db is back online, it's failing with this error: https://gist.github.com/tleyden/56dac8fbba6e8d273210582f59679d51#file-gistfile1-txt-L1221
In the sg logs I see the non-shadower SG getting a DCP error and then temporarily going offline.
https://gist.github.com/tleyden/824d1dba70c5e58f51a8edf8d2409af1#file-gistfile1-txt-L36-L38
Find out more info about which operation is failing .. (some transient error?)
for {
// TODO: memory allocation here.
atomic.AddUint64(&d.stats.TotWorkerReceive, 1)
_, err := pkt.Receive(conn, hdr[:])
if err != nil {
atomic.AddUint64(&d.stats.TotWorkerReceiveErr, 1)
d.receiver.OnError(fmt.Errorf("pkt.Receive, err: %v", err))
return
}
I switched the shadow bucket tests to use tap feeds, and now they are passing reliably.
I'm seeing 503 errors in test_sg_replicate_basic_test
now:
2017-04-26T22:35:32.009-04:00 WARNING: Feed%!(EXTRA string=Error processing DCP stream: %v, *errors.errorString=pkt.Receive, err: read tcp 192.168.33.12:59982->192.168.33.10:11210: read: connection reset by peer) -- base.(*DCPReceiver).OnError() at dcp_feed.go:123
2017-04-26T22:35:32.009-04:00 WARNING: unknown lost Mutation (TAP/DCP) feed due to error: pkt.Receive, err: read tcp 192.168.33.12:59982->192.168.33.10:11210: read: connection reset by peer, taking offline -- db.NewDatabaseContext.func2() at database.go:216
2017-04-26T22:35:32.015-04:00 CRUD: Connection to Mutation (TAP/DCP) feed for db2 re-established, bringing DB back online
2017-04-26T22:35:32.304-04:00 WARNING: No users have been defined in the 'db2' database, which means that you will not be able to get useful data out of the sync gateway over the standard port. FIX: define users in the configuration json or via the REST API on the admin port, and grant users to channels via the admin_channels parameter. -- rest.emitAccessRelatedWarnings() at server_context.go:913
2017-04-26T22:35:32.304-04:00 Starting admin server on 0.0.0.0:4985
2017-04-26T22:35:32.307-04:00 Starting server on :4984 ...
2017-04-26T22:35:32.945-04:00 HTTP: #001: PUT /db2/_user/sg2_user (ADMIN)
2017-04-26T22:35:33.047-04:00 HTTP+: #001: --> 201 Created (101.4 ms)
2017-04-26T22:35:37.480-04:00 WARNING: Feed%!(EXTRA string=Error processing DCP stream: %v, *errors.errorString=pkt.Receive, err: EOF) -- base.(*DCPReceiver).OnError() at dcp_feed.go:123
2017-04-26T22:35:37.480-04:00 WARNING: unknown lost Mutation (TAP/DCP) feed due to error: pkt.Receive, err: EOF, taking offline -- db.NewDatabaseContext.func2() at database.go:216
2017-04-26T22:35:37.484-04:00 CRUD: Connection to Mutation (TAP/DCP) feed for db1 re-established, bringing DB back online
As an experiment I disabled the code that takes the bucket offline in response to dcp errors (4909d849677fd5b7cce0b0954f7265ee1a33ca48), and now test_sg_replicate_basic_test
is passing more reliably
Looks like ignoring the DCP error callbacks might not be safe. After re-running that test a few times, I saw this interesting failure:
The docs is pushed to Sg1 and the test tries to wait until it shows up on the changes feed, but it never does and so the test fails. In the sg1 logs are DCP error callbacks, which I suspect are causing it to miss the update and prevent it from showing up on the changes feed.
Whatever the cause, the docs definitely don't show up on the changes feed (but they are in all_docs)
If I restart sync gateway, they do end up showing up in the changes feed:
[root@localhost sync_gateway]# systemctl restart sync_gateway
[root@localhost sync_gateway]# curl localhost:4985/db1/_changes
{"results":[
{"seq":3,"id":"e67cc07c6f713dc6186bd6b6514fef92","changes":[{"rev":"1-d1953847b126130d153fb7257d4aca48"}]}
,{"seq":4,"id":"204b6e92ca57c271b1a8622033c6c3db","changes":[{"rev":"1-cce809e4c2915e9a92073b7ea21ed3d7"}]}
],
"last_seq":"4"}
Current test being investigated:
./redeploy_local_sg_build.sh -b /Users/tleyden/Development/sync_gateway/godeps/bin/linux_amd64/ && pytest -s -p no:sugar --mode=cc -k test_sg_replicate_basic_test --skip-provisioning testsuites/syncgateway/functional/topology_specific_tests/
In the couchbase server error.log, I'm seeing:
[ns_server:error,2017-04-28T14:13:05.920-04:00,ns_1@192.168.33.10:ns_memcached-data-bucket-1<0.9449.21>:ns_memcached:ensure_bucket:1335]Unable to get config for bucket "data-bucket-1": {error,function_clause,
[{proplists,get_value,
[ram_quota,undefined,
undefined],
[{file,"proplists.erl"},
{line,225}]},
{ns_bucket,config_string,
1,
[{file,
"src/ns_bucket.erl"},
{line,98}]},
{ns_memcached,
ensure_bucket,2,
[{file,
"src/ns_memcached.erl"},
{line,1314}]},
{ns_memcached,handle_info,
2,
[{file,
"src/ns_memcached.erl"},
{line,765}]},
{gen_server,handle_msg,5,
[{file,"gen_server.erl"},
{line,604}]},
{ns_memcached,init,1,
[{file,
"src/ns_memcached.erl"},
{line,177}]},
{gen_server,init_it,6,
[{file,"gen_server.erl"},
{line,304}]},
{proc_lib,init_p_do_apply,
3,
[{file,"proc_lib.erl"},
{line,239}]}]}
[stats:error,2017-04-28T14:13:10.432-04:00,ns_1@192.168.33.10:<0.31046.19>:stats_reader:log_bad_responses:233]Some nodes didn't respond: ['ns_1@192.168.33.10']
[stats:error,2017-04-28T14:13:47.926-04:00,ns_1@192.168.33.10:<0.31046.19>:stats_reader:log_bad_responses:233]Some nodes didn't respond: ['ns_1@192.168.33.10']
and it happens just before the 2017-04-28 14:14:04.932535 I | CBDATASOURCE: Receiver goroutine error: EOF. call OnError()
error
Debugging ideas:
Notes from latest run (sent to @mcpaddy for help analyzing the packet analysis):
This appears to be the dcp stream that gets closed unexpectedly:
Notes:
According to @mcpaddy, tcpdump suggests the connection is being closed on the memcached side. To get memcached to output more verbose logs:
bin/mcctl set verbosity debug
bin/mcctl set verbosity detail
I tried bumping to the latest go-couchbase revision, but running into:
2017-05-02T18:47:48.217-04:00 DCP: OnError: MCResponse status=EACCESS, opcode=0x89, opaque=0, msg: No access.
2017-05-02T18:47:48.217-04:00 WARNING: Feed%!(EXTRA string=Error processing DCP stream: %v, *gomemcached.MCResponse=MCResponse status=EACCESS, opcode=0x89, opaque=0, msg: No access) -- base.(*DCPReceiver).OnError() at dcp_feed.go:125
which is a potential auth issue
Repro 2 bucket cbdatasource issue:
Study why we apparently lose changes if we ignore OnError().
After workaround for #2514, the following functional tests are still failing locally:
Investigating why the bucket shadowing functional tests are failing locally looks to be due to:
2017-05-04T06:33:52.322-04:00 WARNING: Database "db": unable to connect to external bucket for shadowing: Bucket Shadowing only works with the TAP feed type. Please add feed_type=TAP and retry. See https://github.com/couchbase/sync_gateway/issues/2502 -- rest.(*ServerContext)._getOrAddDatabaseFromConfig() at server_context.go:547
Investigating why the test_db_offline_tap_loss_sanity[bucket_online_offline/bucket_online_offline_default_dcp-100]
test is failing (Couchbase server 4.6.1)
In the SG accel logs I see these auth errors:
_time=2017-05-05T16:08:53.403-04:00 _level=INFO _msg=Using plain authentication for user data-bucket
2017/05/05 16:08:53 feed_dcp: OnError, name: data-bucketd5c7e61834375af11203256e50a83aaa_60ed5456e66ad70a: bucketName: data-bucket, bucketUUID: d5c7e61834375af11203256e50a83aaa, err: worker auth, server: 192.168.33.10:11210, user: data-bucket, err: MCResponse status=AUTH_ERROR, opcode=SASL_AUTH, opaque=0, msg: Auth failure
2017/05/05 16:08:55 feed_dcp: OnError, name: data-bucketd5c7e61834375af11203256e50a83aaa_60ed5456e66ad70a: bucketName: data-bucket, bucketUUID: d5c7e61834375af11203256e50a83aaa, err: HTTP error 401 Unauthorized getting "http://192.168.33.10:8091/pools":
2017/05/05 16:08:55 feed_dcp: OnError, name: data-bucketd5c7e61834375af11203256e50a83aaa_60ed5456e66ad70a: bucketName: data-bucket, bucketUUID: d5c7e61834375af11203256e50a83aaa, err: could not connect to any serverURL: []string{"http://192.168.33.10:8091"}
_time=2017-05-05T16:08:55.405-04:00 _level=INFO _msg=Using plain authentication for user data-bucket
2017/05/05 16:08:55 feed_dcp: OnError, name: data-bucketd5c7e61834375af11203256e50a83aaa_60ed5456e66ad70a: bucketName: data-bucket, bucketUUID: d5c7e61834375af11203256e50a83aaa, err: worker auth, server: 192.168.33.10:11210, user: data-bucket, err: MCResponse status=AUTH_ERROR, opcode=SASL_AUTH, opaque=0, msg: Auth failure
2017/05/05 16:08:57 feed_dcp: OnError, name: data-bucketd5c7e61834375af11203256e50a83aaa_60ed5456e66ad70a: bucketName: data-bucket, bucketUUID: d5c7e61834375af11203256e50a83aaa, err: HTTP error 401 Unauthorized getting "http://192.168.33.10:8091/pools":
2017/05/05 16:08:57 feed_dcp: OnError, name: data-bucketd5c7e61834375af11203256e50a83aaa_60ed5456e66ad70a: bucketName: data-bucket, bucketUUID: d5c7e61834375af11203256e50a83aaa, err: could not connect to any serverURL: []string{"http://192.168.33.10:8091"}
_time=2017-05-05T16:08:57.407-04:00 _level=INFO _msg=Using plain authentication for user data-bucket
2017/05/05 16:08:57 feed_dcp: OnError, name: data-bucketd5c7e61834375af11203256e50a83aaa_60ed5456e66ad70a: bucketName: data-bucket, bucketUUID: d5c7e61834375af11203256e50a83aaa, err: worker auth, server: 192.168.33.10:11210, user: data-bucket, err: MCResponse status=AUTH_ERROR, opcode=SASL_AUTH, opaque=0, msg: Auth failure
2017/05/05 16:08:59 feed_dcp: OnError, name: data-bucketd5c7e61834375af11203256e50a83aaa_60ed5456e66ad70a: bucketName: data-bucket, bucketUUID: d5c7e61834375af11203256e50a83aaa, err: HTTP error 401 Unauthorized getting "http://192.168.33.10:8091/pools":
2017/05/05 16:08:59 feed_dcp: OnError, name: data-bucketd5c7e61834375af11203256e50a83aaa_60ed5456e66ad70a: bucketName: data-bucket, bucketUUID: d5c7e61834375af11203256e50a83aaa, err: could not connect to any serverURL: []string{"http://192.168.33.10:8091"}
_time=2017-05-05T16:08:59.410-04:00 _level=INFO _msg=Using plain authentication for user data-bucket
2017/05/05 16:08:59 feed_dcp: OnError, name: data-bucketd5c7e61834375af11203256e50a83aaa_60ed5456e66ad70a: bucketName: data-bucket, bucketUUID: d5c7e61834375af11203256e50a83aaa, err: worker auth, server: 192.168.33.10:11210, user: data-bucket, err: MCResponse status=AUTH_ERROR, opcode=SASL_AUTH, opaque=0, msg: Auth failure
2017/05/05 16:09:01 feed_dcp: OnError, name: data-bucketd5c7e61834375af11203256e50a83aaa_60ed5456e66ad70a: bucketName: data-bucket, bucketUUID: d5c7e61834375af11203256e50a83aaa, err: HTTP error 401 Unauthorized getting "http://192.168.33.10:8091/pools":
2017/05/05 16:09:01 feed_dcp: OnError, name: data-bucketd5c7e61834375af11203256e50a83aaa_60ed5456e66ad70a: bucketName: data-bucket, bucketUUID: d5c7e61834375af11203256e50a83aaa, err: could not connect to any serverURL: []string{"http://192.168.33.10:8091"}
Full SGA logs: https://gist.github.com/tleyden/3b9c04d560a49a41cbb98bf7eef676db
In the SG logs I see these errors:
2017-05-05T16:03:02.798-04:00 WARNING: RetryLoop for Get _sync:role:radio_stations giving up after 11 attempts -- base.RetryLoop() at util.go:296
2017-05-05T16:03:02.798-04:00 HTTP: #815: --> 503 Database timeout error (gocb.ErrTimeout) (32632.8 ms)
2017-05-05T16:03:02.802-04:00 HTTP: #816: GET /db/_role/ (ADMIN)
2017-05-05T16:03:02.803-04:00 HTTP: #816: --> 500 Internal error: not_found - missing (1.2 ms)
2017-05-05T16:03:02.806-04:00 HTTP: #817: GET /db/_role/radio_stations (ADMIN)
More logs: https://gist.github.com/tleyden/0d639989d943ae3550731f0076f442dc
I think I found the issue in https://github.com/couchbase/sync_gateway/issues/2423#issuecomment-299980951:
2017-05-05T17:20:06.967-04:00 Enabling logging: [CRUD+ Cache+ HTTP+ Changes+]
2017-05-05T17:20:06.967-04:00 ==== Couchbase Sync Gateway/feature/remove_gocb_hybrid(85d5add+CHANGES) ====
2017-05-05T17:20:06.967-04:00 requestedSoftFDLimit >= currentHardFdLimit (90000 >= 65535) capping at 65535
2017-05-05T17:20:06.967-04:00 Configured process to allow 65535 open file descriptors
2017-05-05T17:20:06.967-04:00 Opening db /db as bucket "data-bucket", pool "default", server <http://192.168.33.10:8091>
2017-05-05T17:20:06.967-04:00 GoCBCustomSGTranscoder Opening Couchbase database data-bucket on <http://192.168.33.10:8091> as user "data-bucket"
2017-05-05T17:20:07.024-04:00 GoCB Opening Couchbase database index-bucket on <http://192.168.33.10:8091> as user "index-bucket"
2017-05-05T17:20:07.035-04:00 GoCB Opening Couchbase database index-bucket on <http://192.168.33.10:8091> as user "index-bucket"
2017-05-05T17:20:07.041-04:00 WARNING: Error type asserting base.Bucket -> CouchbaseBucket: baseBucket &{0xc420072f30 {http://192.168.33.10:8091 default data-bucket dcpshard 0xc4200b03c0 2 10 5 false} 0xc4200f7c20 0xc4200f7c80 0xc4200f7ce0} is not a CouchbaseBucket. Was type: *base.CouchbaseBucketGoCB. Skipping VerifyBucketSequenceParity(), which checks for invalid index bucket state -- base.VerifyBucketSequenceParity() at util.go:504
goroutine 1 [running]:
runtime/debug.Stack(0x7, 0xc420244c60, 0x15d)
/usr/local/go/src/runtime/debug/stack.go:24 +0x79
runtime/debug.PrintStack()
/usr/local/go/src/runtime/debug/stack.go:16 +0x22
github.com/couchbase/sync_gateway/base.VerifyBucketSequenceParity(0x1077be0, 0xc420296f00, 0x107a7c0, 0xc420072fc0, 0x0, 0x0)
/Users/tleyden/Development/sync_gateway/godeps/src/github.com/couchbase/sync_gateway/base/util.go:505 +0x14e
github.com/couchbase/sync_gateway/db.(*kvChangeIndexReader).verifyBucketSequenceParity(0xc42024aa00, 0xc42010e900, 0x0, 0x0)
/Users/tleyden/Development/sync_gateway/godeps/src/github.com/couchbase/sync_gateway/db/kv_change_index_reader.go:152 +0x6d
github.com/couchbase/sync_gateway/db.(*kvChangeIndexReader).Init(0xc42024aa00, 0xc420198300, 0xc4201796c0, 0xc4202b2930, 0xc4202b2940, 0xc42010e900, 0x10, 0xbac5e0)
/Users/tleyden/Development/sync_gateway/godeps/src/github.com/couchbase/sync_gateway/db/kv_change_index_reader.go:86 +0x343
github.com/couchbase/sync_gateway/db.(*kvChangeIndex).Init(0xc420296570, 0xc42010e900, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/Users/tleyden/Development/sync_gateway/godeps/src/github.com/couchbase/sync_gateway/db/kv_change_index.go:53 +0x114
sg-accel was using GoCB
(with CustomSGTranscoder) to open the data-bucket, even though the feedtype was set to DCPSHARD
https://github.com/couchbase/sync_gateway/issues/2423#issuecomment-299980951 is fixed now.
New failing test:
# {"mode": "di", "suite":DEFAULT_SUITE, "testname":"test_db_online_offline_webhooks_offline_two[sync_gateway_webhook-5-1-1-2]"},
fails with logs:
I kicked off a functional test on jenkins to see if it's just a local setup issue:
http://uberjenkins.sc.couchbase.com/job/cen7-sync-gateway-functional-tests-base-di/491/
Failing sg-accel unit test:
./test.sh -run TestInterruptedBackfillWithWritesAndHistory
error:
fine users in the configuration json or via the REST API on the admin port, and grant users to channels via the admin_channels parameter. -- rest.emitAccessRelatedWarnings() at server_context.go:919
2017-05-05T18:58:59.428-04:00 Opening Walrus database sg_index_bucket on <walrus:>
2017-05-05 18:58:59.428216 I | it: {restTester:{_bucket:<nil> _sc:0xc420117400 noAdminParty:false distributedIndex:false syncFn:function(doc) {channel(doc.channel); access(doc.accessUser, doc.accessChannel);} cacheConfig:<nil>} mockSequenceGenerator:0xc420115c80 _indexBucket:<nil> accelContext:0xc420115d20 dbName:db serverName:walrus: numVbuckets:1024 numShards:16 incomingMockSgOps:0xc4200fd260 indexWriter:<nil>}
2017-05-05 18:58:59.428238 I | it serverContext: &{config:0xc42011cd80 databases_:map[db:0xc42011e900] lock:{w:{state:0 sema:0} writerSem:0 readerSem:0 readerCount:0 readerWait:0} statsTicker:<nil> HTTPClient:0x137e020 replicator:0xc420179d10}
2017-05-05 18:58:59.432874 I | updateAndFetchUser calling endpoint: /db/_user/david
2017-05-05 18:58:59.606414 I | getUser calling endpoint: /db/_user/david
2017-05-05T18:58:59.609-04:00 Initializing index partitions from bucket, found 16 partitions
2017-05-05T18:58:59.751-04:00 Changes+: Vector multi changes feed...
2017-05-05T18:58:59.751-04:00 Changes+: Vector MultiChangesFeed({*}, {Since:0 Limit:0 Conflicts:true IncludeDocs:false Wait:false Continuous:false Terminator:0xc420c38720 HeartbeatMs:0 TimeoutMs:5000 ActiveOnly:false}) ... (to david)
2017-05-05T18:58:59.752-04:00 Changes+: MultiChangesFeed: channels expand to "!:747.1" ... (to david)
2017-05-05T18:58:59.752-04:00 Changes+: GotChannelSince... !:747.1
2017-05-05T18:58:59.752-04:00 Changes+: Starting for channel... !, 747.1
2017-05-05T18:58:59.752-04:00 Changes+: Starting backfill for channel... !, Since:[0:0], TriggeredBy:[747:1]
2017-05-05T18:58:59.752-04:00 Changes+: [changesFeed] Found 0 non-backfill changes for channel !
2017-05-05T18:58:59.752-04:00 Changes+: MultiChangesFeed done (to david)
2017-05-05 18:58:59.752441 I | getChanges response: {"results":[
],
"last_seq":"0"}
index_writer_changes_test.go:2309
assert.Equals(t, len(results), 2)
expected: 2
got: 0
--- FAIL: TestInterruptedBackfillWithWritesAndHistory (0.36s)
assert.go:31:
panic: runtime error: index out of range [recovered]
panic: runtime error: index out of range
The GoCB hybrid bucket was basically a crutch to scope limit the transition.
However, certain things that require type assertions get complicated, and it's probably better to drop the gocb hybrid bucket.
For example, when dealing with error handling, you might not know which bucket was actually used for the call to Couchbase Server, and so trying to figure out which kind of error it is gets complicated.
From last jenkins runs, looks like there is only one failing functional test at this point.
Failing functional tests
Base CC
Topology specific CC
Base DI
NOTE: test_continuous_changes_sanity[sync_gateway_default_functional_tests-10-10 failed in cen7-sync-gateway-functional-tests-base-di/496 but passed in a later rerun. More reruns in progress: rerun2 rerun3
Topology specific DI