couchbase / sync_gateway

Manages access and synchronization between Couchbase Lite and Couchbase Server
https://www.couchbase.com/products/sync-gateway
Other
447 stars 138 forks source link

Implement all required methods in GoCBBucket and remove GoCB Hybrid bucket #2423

Closed tleyden closed 7 years ago

tleyden commented 7 years ago

The GoCB hybrid bucket was basically a crutch to scope limit the transition.

However, certain things that require type assertions get complicated, and it's probably better to drop the gocb hybrid bucket.

For example, when dealing with error handling, you might not know which bucket was actually used for the call to Couchbase Server, and so trying to figure out which kind of error it is gets complicated.

From last jenkins runs, looks like there is only one failing functional test at this point.

Failing functional tests

Base CC

Topology specific CC

Base DI

NOTE: test_continuous_changes_sanity[sync_gateway_default_functional_tests-10-10 failed in cen7-sync-gateway-functional-tests-base-di/496 but passed in a later rerun. More reruns in progress: rerun2 rerun3

Topology specific DI

tleyden commented 7 years ago

@adamcfraser after this change, should it be possible to remove this type assertion?

func (k *KvChangeIndexWriter) rollbackView(vbNo uint16, seq uint64) (channelNames []string, err error) {

    channelNames = make([]string, 0)
    startKey := []interface{}{vbNo, seq}
    endKey := []interface{}{vbNo, uint64(math.MaxUint64)}

    if goCbBucket, err := base.GetGoCBBucketFromBaseBucket(k.indexWriteBucket); err == nil {
        viewQuery := gocb.NewViewQuery("sg-accel", "rollback")
        viewQuery.Range(startKey, endKey, true)
        viewQuery.Stale(gocb.Before)
        results, err := goCbBucket.Bucket.ExecuteViewQuery(viewQuery)
        if err != nil {
            return channelNames, err
        }
        if results != nil {
            var vrow struct {
                Value string
            }
            for results.Next(&vrow) {
                channelNames = append(channelNames, vrow.Value)
                rollbackViewRows.Add(int64(1))
            }
            // Any error processing view results is returned on Close
            err := results.Close()
            if err != nil {
                return channelNames, err
            }
        }

    } else {
        // walrus/go-couchbase view retrieval
        var vres struct {
            Rows []struct {
                Value string
            }
        }
        opts := map[string]interface{}{"stale": false, "startkey": startKey, "endkey": endKey}
        err := k.indexWriteBucket.ViewCustom("sg-accel", "rollback", opts, &vres)
        if err != nil {
            return channelNames, err
        }
        for _, row := range vres.Rows {
            channelNames = append(channelNames, row.Value)
        }
        rollbackViewRows.Add(int64(len(vres.Rows)))
    }

    return channelNames, nil
}

and just call down to the ViewCustom() method in the bucket, even if it's a gocb bucket? I noticed this when looking at the places we do type assertions on buckets.

adamcfraser commented 7 years ago

@tleyden ViewCustom is currently a no-op on the gocb bucket. If I recall correctly, I looked at putting the above code into ViewCustom originally, but the method signatures weren't an ideal fit.

tleyden commented 7 years ago

Does gocb have or need the equivalent of go-couchbase refresh?

Or should I ask the SDK team?

tleyden commented 7 years ago

I'm thinking Refresh() can be a no-op in the gocb bucket -- currently it panics

adamcfraser commented 7 years ago

@tleyden I think we're currently only using Refresh when trying to reconnect to a bucket that we've lost connection to. I think we need some way to do the equivalent in gocb, but we might be able to just do something manually on the SG side (drop and reconnect?).

I think it would be good to describe the use case to the SDK team and see what the recommended set of bucket operations would be.

tleyden commented 7 years ago

Posted a question to the gosdk forum

tleyden commented 7 years ago

Failing functional tests:

tleyden commented 7 years ago

Investigate this failure:

pytest -p no:sugar -s --skip-provisioning --mode=cc --server-version=4.6.1 --sync-gateway-version=9ef50d1e967882af2022d847dfb89d71cbce783d -k test_multiple_dbs_unique_buckets_lose_tap[bucket_online_offline/bucket_online_offline_multiple_dbs_unique_buckets-100] testsuites/syncgateway/functional/tests/

why is this returning 404 instead of db offline?

2017-04-17T18:22:31.371-04:00 HTTP:  #5100: GET /db1/  (ADMIN)
2017-04-17T18:22:31.371-04:00 HTTP: #5100:     --> 404 no such database "db1"  (0.1 ms)
tleyden commented 7 years ago

Sync Gateway logs when bucket deleted by test:

_time=2017-04-17T18:06:15.382-04:00 _level=INFO _msg= Trying with http://192.168.33.10:8091/pools/default/bucketsStreaming/data-bucket-1
_time=2017-04-17T18:06:15.385-04:00 _level=ERROR _msg=Failed to connect to host, unexpected status code: 401. Body
_time=2017-04-17T18:06:15.385-04:00 _level=INFO _msg= Trying with http://192.168.33.10:8091/pools/default/bucketsStreaming/data-bucket-1
_time=2017-04-17T18:06:15.387-04:00 _level=ERROR _msg=Failed to connect to host, unexpected status code: 401. Body
_time=2017-04-17T18:06:15.387-04:00 _level=INFO _msg= Trying with http://192.168.33.10:8091/pools/default/bucketsStreaming/data-bucket-1
_time=2017-04-17T18:06:15.388-04:00 _level=ERROR _msg=Failed to connect to host, unexpected status code: 401. Body
_time=2017-04-17T18:06:15.388-04:00 _level=INFO _msg= Trying with http://192.168.33.10:8091/pools/default/bucketsStreaming/data-bucket-1
_time=2017-04-17T18:06:15.390-04:00 _level=ERROR _msg=Failed to connect to host, unexpected status code: 401. Body
_time=2017-04-17T18:06:15.390-04:00 _level=ERROR _msg= Maximum failures reached. Exiting loop...
2017-04-17T18:06:15.390-04:00 WARNING: Bucket Updater for bucket data-bucket-1 returned error: Max failures reached. Last Error Failed to connect to host. Status 401 Body  -- base.GetCouchbaseBucket.func1() at bucket.go:476
2017-04-17T18:06:15.390-04:00 WARNING: Lost TAP feed for bucket data-bucket-1, with error: Max failures reached. Last Error Failed to connect to host. Status 401 Body  -- rest.(*ServerContext)._getOrAddDatabaseFromConfig.func1() at server_context.go:351
2017-04-17T18:06:15.390-04:00 CRUD: Connection to TAP feed for db1 re-established, bringing DB back online
_time=2017-04-17T18:06:15.521-04:00 _level=INFO _msg=go-couchbase: TAP connection lost; reconnecting to bucket "data-bucket-1" in 1s
_time=2017-04-17T18:06:16.440-04:00 _level=ERROR _msg= Unable to decode response EOF
_time=2017-04-17T18:06:16.440-04:00 _level=INFO _msg= Trying with http://192.168.33.10:8091/pools/default/bucketsStreaming/data-bucket-3
_time=2017-04-17T18:06:16.443-04:00 _level=ERROR _msg=Failed to connect to host, unexpected status code: 401. Body
_time=2017-04-17T18:06:16.443-04:00 _level=INFO _msg= Trying with http://192.168.33.10:8091/pools/default/bucketsStreaming/data-bucket-3
_time=2017-04-17T18:06:16.444-04:00 _level=ERROR _msg=Failed to connect to host, unexpected status code: 401. Body
_time=2017-04-17T18:06:16.444-04:00 _level=INFO _msg= Trying with http://192.168.33.10:8091/pools/default/bucketsStreaming/data-bucket-3
_time=2017-04-17T18:06:16.446-04:00 _level=ERROR _msg=Failed to connect to host, unexpected status code: 401. Body
_time=2017-04-17T18:06:16.446-04:00 _level=INFO _msg= Trying with http://192.168.33.10:8091/pools/default/bucketsStreaming/data-bucket-3
_time=2017-04-17T18:06:16.448-04:00 _level=ERROR _msg=Failed to connect to host, unexpected status code: 401. Body
_time=2017-04-17T18:06:16.448-04:00 _level=ERROR _msg= Maximum failures reached. Exiting loop...
2017-04-17T18:06:16.448-04:00 WARNING: Bucket Updater for bucket data-bucket-3 returned error: Max failures reached. Last Error Failed to connect to host. Status 401 Body  -- base.GetCouchbaseBucket.func1() at bucket.go:476
2017-04-17T18:06:16.448-04:00 WARNING: Lost TAP feed for bucket data-bucket-3, with error: Max failures reached. Last Error Failed to connect to host. Status 401 Body  -- rest.(*ServerContext)._getOrAddDatabaseFromConfig.func1() at server_context.go:351
2017-04-17T18:06:16.448-04:00 CRUD: Connection to TAP feed for db3 re-established, bringing DB back online
_time=2017-04-17T18:06:16.523-04:00 _level=INFO _msg=go-couchbase: TAP connection lost; reconnecting to bucket "data-bucket-1" in 2s
_time=2017-04-17T18:06:16.547-04:00 _level=INFO _msg=go-couchbase: TAP connection lost; reconnecting to bucket "data-bucket-3" in 1s
2017-04-17T18:06:17.480-04:00 HTTP:  #3257: PUT /db2/_role/radio_stations  (ADMIN)
2017-04-17T18:06:17.481-04:00 HTTP+: #3257:     --> 200 OK  (0.6 ms)
2017-04-17T18:06:17.484-04:00 HTTP:  #3258: GET /db2/_role/  (ADMIN)
_time=2017-04-17T18:06:17.550-04:00 _level=INFO _msg=go-couchbase: TAP connection lost; reconnecting to bucket "data-bucket-3" in 2s
tleyden commented 7 years ago

I'm also seeing view_undefined errors in the logs while running the above test:

2017-04-17T18:05:40.978-04:00 WARNING: Error trying to query ViewPrincipals: unknown_error - view_undefined -- rest.collectAccessRelatedWarnings() at server_context.go:915

and when I try to run Sync Gateway locally, I'm seeing it fail on startup, also due to view_undefined errors:

2017-04-17T17:39:32.600-07:00 Bucket: ViewCustom("sync_gateway", "access", ...) [24.074254ms]
2017-04-17T17:39:32.600-07:00 WARNING: channelComputer.ComputeChannelsForPrincipal failed on &{{foo   0  <nil>} { false [] <nil>   []} 0xc4202e0fa0 []}: unknown_error - view_undefined -- auth.(*Authenticator).rebuildChannels() at auth.go:156
2017-04-17T17:39:32.600-07:00 FATAL: Error opening database: Couldn't create user "foo": unknown_error - view_undefined -- rest.RunServer() at config.go:749
tleyden commented 7 years ago

Restarting Sync Gateway fixes the above error.

Logs: https://gist.github.com/tleyden/4e9c5e00b0c7e75bf96dc25b6975b083

I remember seeing this at one point and intending to go back to it.

tleyden commented 7 years ago

I haven't seen the view_undefined error again, even after attempts to reproduce it by creating an empty bucket and restarting Sync Gateway. Keeping an eye out for it for now

tleyden commented 7 years ago

With these steps:

  1. Create db bucket
  2. Start SG (with config that uses db bucket)
  3. Delete db bucket

I'm getting a 404 when going to http://localhost:4985/db/, and seeing the following log output:

https://gist.github.com/tleyden/2ebb1d275c8d1cbdd63b891f3c3d7f3b

I will try comparing the behavior against the master branch to see where it's differing.

tleyden commented 7 years ago

On the master branch, after trying to repro with above steps, going to http://localhost:4985/db/ returns:

{
committed_update_seq: 0,
compact_running: false,
db_name: "db",
disk_format_version: 0,
instance_start_time: 1492482029507746,
purge_seq: 0,
state: "Offline",
update_seq: 0
}

Here are the full SG logs:

https://gist.github.com/tleyden/4e620ab67f6780eb04833bb172eff4c3

Since there are lots of calls to Refresh(), it's probably related to the fact that Refresh() is a no-op on the GoCBHybridBucket

tleyden commented 7 years ago

Commit 455c97eb1b280b63677019ca76c60518dd522333 fixes the issue mentioned in https://github.com/couchbase/sync_gateway/issues/2423#issuecomment-294651245, but this test is still failing:

pytest -p no:sugar -s --skip-provisioning --mode=cc --server-version=4.6.1 --sync-gateway-version=9ef50d1e967882af2022d847dfb89d71cbce783d -k test_multiple_dbs_unique_buckets_lose_tap[bucket_online_offline/bucket_online_offline_multiple_dbs_unique_buckets-100] testsuites/syncgateway/functional/tests/

However since go-couchbase is being used for the tap feed, I think the hybrid bucket might still want to call down to go-couchbse Refresh() rather than using the gocb bucket Refresh()

tleyden commented 7 years ago

In a551f976cb482b3d3167c2dccd891c4421d0d0f1 I switched back to the go-couchbase Refresh() implementation, and now the bucket_online_offline_multiple_dbs_unique_buckets-100] functional test passes.

tleyden commented 7 years ago

I'm running the functional tests against the latest commit, and seeing more failures: http://uberjenkins.sc.couchbase.com/job/cen7-sync-gateway-functional-tests-base-cc/470/console

This failed locally for me as well:

pytest -p no:sugar -s --skip-provisioning --mode=cc -k test_bucket_online_offline_resync_sanity[bucket_online_offline/db_online_offline_access_all-5-100-10] testsuites/syncgateway/functional/tests/
tleyden commented 7 years ago

Todo:

tleyden commented 7 years ago

Functional tests in progress:

http://uberjenkins.sc.couchbase.com/job/cen7-sync-gateway-functional-tests-base-cc/487/console

I see one failure:

testsuites/syncgateway/functional/tests/test_db_online_offline.py::test_multiple_dbs_unique_buckets_lose_tap[bucket_online_offline/bucket_online_offline_multiple_dbs_unique_buckets-100] (fixtures used: environment, num_docs, params_from_base_suite_setup, params_from_base_test_setup, sg_conf_name)FAILED

but running locally passed:

pytest  -s --mode=cc --skip-provisioning -k "test_multiple_dbs_unique_buckets_lose_tap[bucket_online_offline/bucket_online_offline_multiple_dbs_unique_buckets-100]" testsuites/syncgateway/functional/tests/
tleyden commented 7 years ago

SG Accel not starting up due to this error:

2017-04-20T19:04:36.920-07:00 FATAL: Error initializing SG Accel context: Error initializing CBGT: GoCB bucket does not expose UUID -- sg_accel.RunServer() at server.go:48
tleyden commented 7 years ago

This test is failing on jenkins and locally:

pytest  -s --mode=cc --skip-provisioning -k "test_user_views_sanity[user_views/user_views]" testsuites/syncgateway/functional/tests/

and digging into it, I see this panic in the sg logs:

2017-04-20 10:06:51.829211 I | http: panic serving 192.168.33.1:56195: interface conversion: interface is string, not bool
goroutine 480 [running]:
net/http.(*conn).serve.func1(0xc42030e300)
        /usr/local/go/src/net/http/server.go:1491 +0x12a
panic(0xad3880, 0xc42011eb00)
        /usr/local/go/src/runtime/panic.go:458 +0x243
github.com/couchbase/sync_gateway/base.applyViewQueryOptions(0xc42011eac0, 0xc420f1e690)
        /home/centos/godeps/src/github.com/couchbase/sync_gateway/base/bucket_gocb.go:1553 +0x1419
github.com/couchbase/sync_gateway/base.CouchbaseBucketGoCB.View(0xc42007f200, 0xc42010bc40, 0x19, 0xb91d43, 0x7, 0xc4200ffa40, 0xb, 0xc4200ffda3, 0x3, 0xf39900, ...)
        /home/centos/godeps/src/github.com/couchbase/sync_gateway/base/bucket_gocb.go:1360 +0x17d
github.com/couchbase/sync_gateway/base.(*CouchbaseBucketGoCB).View(0xc42007f290, 0xc42011e5d0, 0xa, 0xc42011e5e1, 0x8, 0xc420f1e690, 0x0, 0x0, 0x0, 0x0, ...)
        <autogenerated>:482 +0xfa
github.com/couchbase/sync_gateway/db.(*Database).QueryDesignDoc(0xc421f59620, 0xc42011e5d0, 0xa, 0xc42011e5e1, 0x8, 0xc420f1e690, 0x3, 0x0, 0x316a203800512ac8)
        /home/centos/godeps/src/github.com/couchbase/sync_gateway/db/design_doc.go:190 +0xff
github.com/couchbase/sync_gateway/rest.(*handler).handleView(0xc420d0aab0, 0xc421f59620, 0x0)
        /home/centos/godeps/src/github.com/couchbase/sync_gateway/rest/view_api.go:105 +0x594
github.com/couchbase/sync_gateway/rest.(*handler).invoke(0xc420d0aab0, 0xc4c5c8, 0x0, 0x0)
        /home/centos/godeps/src/github.com/couchbase/sync_gateway/rest/handler.go:221 +0x2c1
github.com/couchbase/sync_gateway/rest.makeHandler.func1(0xf3fd40, 0xc422166a90, 0xc423568d20)
        /home/centos/godeps/src/github.com/couchbase/sync_gateway/rest/handler.go:96 +0x80
net/http.HandlerFunc.ServeHTTP(0xc4203ff9a0, 0xf3fd40, 0xc422166a90, 0xc423568d20)
        /usr/local/go/src/net/http/server.go:1726 +0x44
github.com/gorilla/mux.(*Router).ServeHTTP(0xc4203082d0, 0xf3fd40, 0xc422166a90, 0xc423568d20)
        /home/centos/godeps/src/github.com/gorilla/mux/mux.go:86 +0x204
github.com/couchbase/sync_gateway/rest.wrapRouter.func1(0xf3fd40, 0xc422166a90, 0xc423568d20)
        /home/centos/godeps/src/github.com/couchbase/sync_gateway/rest/routing.go:297 +0x66f
net/http.HandlerFunc.ServeHTTP(0xc4203368a0, 0xf3fd40, 0xc422166a90, 0xc423568d20)
        /usr/local/go/src/net/http/server.go:1726 +0x44
net/http.serverHandler.ServeHTTP(0xc4203a7980, 0xf3fd40, 0xc422166a90, 0xc423568d20)
        /usr/local/go/src/net/http/server.go:2202 +0x7d
net/http.(*conn).serve(0xc42030e300, 0xf40900, 0xc42011e580)
        /usr/local/go/src/net/http/server.go:1579 +0x4b7
created by net/http.(*Server).Serve
        /usr/local/go/src/net/http/server.go:2293 +0x44d
2017-04-20T10:06:57.389-04:00 HTTP:  #024: GET /db/_design/test_views/_view/filtered?stale=false  (as seth)
2017-04-20T10:06:57.389-04:00 HTTP: JSON view "test_views"/"filtered" - opts map[stale:false]
tleyden commented 7 years ago

Now digging into this error:

pytest -s --mode=cc --skip-provisioning -k "test_overloaded_channel_cache[sync_gateway_channel_cache-5000-*-True-50]" testsuites/syncgateway/functional/tests/

resulting in failure: https://gist.github.com/tleyden/6590e8976923756ff1a7633614669da3

tleyden commented 7 years ago

If I reduce the number of users and docs in the functional test, it passes consistently.

tleyden commented 7 years ago

After digging into this, it looks like the problem is that go-cb has a slightly different behavior with regards to opening concurrent view query connections to couchbase server, and doesn't have any throttling.

Against this branch I was seeing upwards of 900 connections from sync gw -> couchbase server on the views port:

[root@localhost sync_gateway]# netstat -an | grep -i 8092 | wc -l
902

where as with the master branch, it never went above 100:

[root@localhost sync_gateway]# netstat -an | grep -i 8092 | wc -l
100

Planned fix: extend the 'MaxConcurrentBulkOps' technique used in bucket_gocb to a 'MaxConcurrentViewOps' setting, and use that to throttle

tleyden commented 7 years ago

The above fix seems to improved things, but now this job http://uberjenkins.sc.couchbase.com/job/cen7-sync-gateway-functional-tests-base-cc/499/artifact/results/report.html is failing due to:

            if user_channels == "*" and num_docs == 5000:
                # "*" channel includes _user docs so the verify_changes will result in 10 view queries
>               assert resp_obj["syncGateway_changeCache"]["view_queries"] == 10
E               KeyError: 'view_queries'
tleyden commented 7 years ago
tleyden commented 7 years ago

On master branch:

#####
T +0.008878 192.168.33.1:60480 -> 192.168.33.11:4984 [AP]
POST /db/_changes HTTP/1.1.
Host: 192.168.33.11:4984.
Connection: keep-alive.
Accept-Encoding: gzip, deflate.
Accept: */*.
User-Agent: python-requests/2.13.0.
Content-Type: application/json.
Authorization: Basic dXNlcl8wOnBhc3N3b3Jk.
Content-Length: 22.
.
{"include_docs": true}
##
T +0.065946 192.168.33.11:4984 -> 192.168.33.1:60480 [AP]
HTTP/1.1 200 OK.
Cache-Control: private, max-age=0, no-cache, no-store.
Content-Type: application/json.
Server: Couchbase Sync Gateway/1.4 branch/HEAD(nobranch) commit/d00b604.
Date: Mon, 24 Apr 2017 21:25:50 GMT.
Transfer-Encoding: chunked.
.
800.
{"results":[.
{"seq":1,"id":"_user/user_0","changes":[]}
,{"seq":1002,"id":"7c0ded56-308c-496a-8062-77223013fa88","doc":{"_id":"7c0ded56-308c-496a-8062-77223013fa88","_rev":"1-c3bcf352dc26b1790d78cfe030738faf","channels":["ABC"],"updates":0},"changes":[{"rev":"1-c3bcf352dc26b1790d78cfe030738faf"}]}
,{"seq":1003,"id":"738094a1-28a5-4322-9d8d-0dcf9b3018cc","doc":{"_id":"738094a1-28a5-4322-9d8d-0dcf9b3018cc","_rev":"1-c3bcf352dc26b1790d78cfe030738faf","channels":["ABC"],"updates":0},"changes":[{"rev":"1-c3bcf352dc26b1790d78cfe030738faf"}]}

test output:

 |user_4| expected (num_docs: 5000 num_revisions: 0) _changes (num_docs: 5000 updates: 0)
 -> |user_6| expected (num_docs: 5000 num_revisions: 0) _changes (num_docs: 5000 updates: 0)
 -> |user_36| expected (num_docs: 5000 num_revisions: 0) _changes (num_docs: 5000 updates: 0)
 -> |user_34| expected (num_docs: 5000 num_revisions: 0) _changes (num_docs: 5000 updates: 0)
 -> |user_44| expected (num_docs: 5000 num_revisions: 0) _changes (num_docs: 5000 updates: 0)
 -> |user_43| expected (num_docs: 5000 num_revisions: 0) _changes (num_docs: 5000 updates: 0)
 -> |user_30| expected (num_docs: 5000 num_revisions: 0) _changes (num_docs: 5000 updates: 0)
 -> |user_32| expected (num_docs: 5000 num_revisions: 0) _changes (num_docs: 5000 updates: 0)
syncGateway_changeCache: {u'maxPending': 10, u'lag-tap-0000ms': 4997, u'view_queries': 10, u'lag-total-9223372036800ms': 1001, u'lag-queue-0000ms': 6001, u'lag-total-0000ms': 4997, u'lag-total-0100ms': 3, u'lag-tap-0100ms': 3}

      TEARDOWN F limit[50]
      TEARDOWN F filter[True]
      TEARDOWN F user_channels[*]
      TEARDOWN F num_docs[5000]
      TEARDOWN F sg_conf_name[sync_gateway_channel_cache]Tearing down test 'test_overloaded_channel_cache[sync_gateway_channel_cache-5000-*-True-50]'
----------------
Socket usage on mobile-testkit client ...
ESTABLISHED: 48
TIME_WAIT: 0

resources/cluster_configs/base_cc
cbs: [{'ip': u'192.168.33.10', 'name': u'cb1'}]
sgs: [{'ip': u'192.168.33.11', 'name': u'sg1'}]
acs: []
ssl: False
sync_gateway: http://192.168.33.11:4984, info: {"couchdb":"Welcome","vendor":{"name":"Couchbase Sync Gateway","version":1.4},"version":"Couchbase Sync Gateway/HEAD(nobranch)(d00b604)"}

      TEARDOWN F params_from_base_test_setupTearing down 'params_from_base_suite_setup' ...

TEARDOWN S params_from_base_suite_setup
 testsuites/syncgateway/functional/tests/test_overloaded_channel_cache.pytest_overloaded_channel_cache[sync_gateway_channel_cache-5000-*-True-50] ✓100% ██████████
----------------------------------- generated xml file: /Users/tleyden/Development/mobile-testkit/results/results.xml -----------------------------------
---------------------------------- generated html file: /Users/tleyden/Development/mobile-testkit/results/report.html -----------------------------------
================================================================= 127 tests deselected ==================================================================

Results (251.16s):
       1 passed
     127 deselected
tleyden commented 7 years ago

During testing, ran into https://github.com/couchbaselabs/mobile-testkit/issues/1142

tleyden commented 7 years ago

This test (and the rest of the bucket shadowing tests) is failing locally:

pytest -s -p no:sugar --mode=cc -k test_bucket_shadow_low_revs_limit --server-version=4.5.1 --sync-gateway-version=674d1a9482c61d606d77831cecd312ce1f77ed43 testsuites/syncgateway/functional/topology_specific_tests/
tleyden commented 7 years ago

Even after adding code to the functional test to wait until the db is back online, it's failing with this error: https://gist.github.com/tleyden/56dac8fbba6e8d273210582f59679d51#file-gistfile1-txt-L1221

In the sg logs I see the non-shadower SG getting a DCP error and then temporarily going offline.

https://gist.github.com/tleyden/824d1dba70c5e58f51a8edf8d2409af1#file-gistfile1-txt-L36-L38

tleyden commented 7 years ago

Find out more info about which operation is failing .. (some transient error?)

        for {
            // TODO: memory allocation here.
            atomic.AddUint64(&d.stats.TotWorkerReceive, 1)
            _, err := pkt.Receive(conn, hdr[:])
            if err != nil {
                atomic.AddUint64(&d.stats.TotWorkerReceiveErr, 1)
                d.receiver.OnError(fmt.Errorf("pkt.Receive, err: %v", err))
                return
            }
tleyden commented 7 years ago

I switched the shadow bucket tests to use tap feeds, and now they are passing reliably.

I'm seeing 503 errors in test_sg_replicate_basic_test now:

2017-04-26T22:35:32.009-04:00 WARNING: Feed%!(EXTRA string=Error processing DCP stream: %v, *errors.errorString=pkt.Receive, err: read tcp 192.168.33.12:59982->192.168.33.10:11210: read: connection reset by peer) -- base.(*DCPReceiver).OnError() at dcp_feed.go:123
2017-04-26T22:35:32.009-04:00 WARNING: unknown lost Mutation (TAP/DCP) feed due to error: pkt.Receive, err: read tcp 192.168.33.12:59982->192.168.33.10:11210: read: connection reset by peer, taking offline -- db.NewDatabaseContext.func2() at database.go:216
2017-04-26T22:35:32.015-04:00 CRUD: Connection to Mutation (TAP/DCP) feed for db2 re-established, bringing DB back online
2017-04-26T22:35:32.304-04:00 WARNING: No users have been defined in the 'db2' database, which means that you will not be able to get useful data out of the sync gateway over the standard port.  FIX: define users in the configuration json or via the REST API on the admin port, and grant users to channels via the admin_channels parameter. -- rest.emitAccessRelatedWarnings() at server_context.go:913
2017-04-26T22:35:32.304-04:00 Starting admin server on 0.0.0.0:4985
2017-04-26T22:35:32.307-04:00 Starting server on :4984 ...
2017-04-26T22:35:32.945-04:00 HTTP:  #001: PUT /db2/_user/sg2_user  (ADMIN)
2017-04-26T22:35:33.047-04:00 HTTP+: #001:     --> 201 Created  (101.4 ms)
2017-04-26T22:35:37.480-04:00 WARNING: Feed%!(EXTRA string=Error processing DCP stream: %v, *errors.errorString=pkt.Receive, err: EOF) -- base.(*DCPReceiver).OnError() at dcp_feed.go:123
2017-04-26T22:35:37.480-04:00 WARNING: unknown lost Mutation (TAP/DCP) feed due to error: pkt.Receive, err: EOF, taking offline -- db.NewDatabaseContext.func2() at database.go:216
2017-04-26T22:35:37.484-04:00 CRUD: Connection to Mutation (TAP/DCP) feed for db1 re-established, bringing DB back online
tleyden commented 7 years ago

As an experiment I disabled the code that takes the bucket offline in response to dcp errors (4909d849677fd5b7cce0b0954f7265ee1a33ca48), and now test_sg_replicate_basic_test is passing more reliably

tleyden commented 7 years ago

Looks like ignoring the DCP error callbacks might not be safe. After re-running that test a few times, I saw this interesting failure:

The docs is pushed to Sg1 and the test tries to wait until it shows up on the changes feed, but it never does and so the test fails. In the sg1 logs are DCP error callbacks, which I suspect are causing it to miss the update and prevent it from showing up on the changes feed.

Whatever the cause, the docs definitely don't show up on the changes feed (but they are in all_docs)

If I restart sync gateway, they do end up showing up in the changes feed:

[root@localhost sync_gateway]# systemctl restart sync_gateway
[root@localhost sync_gateway]# curl localhost:4985/db1/_changes
{"results":[
{"seq":3,"id":"e67cc07c6f713dc6186bd6b6514fef92","changes":[{"rev":"1-d1953847b126130d153fb7257d4aca48"}]}
,{"seq":4,"id":"204b6e92ca57c271b1a8622033c6c3db","changes":[{"rev":"1-cce809e4c2915e9a92073b7ea21ed3d7"}]}
],
"last_seq":"4"}
tleyden commented 7 years ago

Current test being investigated:

./redeploy_local_sg_build.sh -b /Users/tleyden/Development/sync_gateway/godeps/bin/linux_amd64/ && pytest -s -p no:sugar --mode=cc -k test_sg_replicate_basic_test --skip-provisioning  testsuites/syncgateway/functional/topology_specific_tests/
tleyden commented 7 years ago

In the couchbase server error.log, I'm seeing:

[ns_server:error,2017-04-28T14:13:05.920-04:00,ns_1@192.168.33.10:ns_memcached-data-bucket-1<0.9449.21>:ns_memcached:ensure_bucket:1335]Unable to get config for bucket "data-bucket-1": {error,function_clause,
                                                  [{proplists,get_value,
                                                    [ram_quota,undefined,
                                                     undefined],
                                                    [{file,"proplists.erl"},
                                                     {line,225}]},
                                                   {ns_bucket,config_string,
                                                    1,
                                                    [{file,
                                                      "src/ns_bucket.erl"},
                                                     {line,98}]},
                                                   {ns_memcached,
                                                    ensure_bucket,2,
                                                    [{file,
                                                      "src/ns_memcached.erl"},
                                                     {line,1314}]},
                                                   {ns_memcached,handle_info,
                                                    2,
                                                    [{file,
                                                      "src/ns_memcached.erl"},
                                                     {line,765}]},
                                                   {gen_server,handle_msg,5,
                                                    [{file,"gen_server.erl"},
                                                     {line,604}]},
                                                   {ns_memcached,init,1,
                                                    [{file,
                                                      "src/ns_memcached.erl"},
                                                     {line,177}]},
                                                   {gen_server,init_it,6,
                                                    [{file,"gen_server.erl"},
                                                     {line,304}]},
                                                   {proc_lib,init_p_do_apply,
                                                    3,
                                                    [{file,"proc_lib.erl"},
                                                     {line,239}]}]}
[stats:error,2017-04-28T14:13:10.432-04:00,ns_1@192.168.33.10:<0.31046.19>:stats_reader:log_bad_responses:233]Some nodes didn't respond: ['ns_1@192.168.33.10']
[stats:error,2017-04-28T14:13:47.926-04:00,ns_1@192.168.33.10:<0.31046.19>:stats_reader:log_bad_responses:233]Some nodes didn't respond: ['ns_1@192.168.33.10']

and it happens just before the 2017-04-28 14:14:04.932535 I | CBDATASOURCE: Receiver goroutine error: EOF. call OnError() error

tleyden commented 7 years ago

Debugging ideas:

tleyden commented 7 years ago

Notes from latest run (sent to @mcpaddy for help analyzing the packet analysis):

This appears to be the dcp stream that gets closed unexpectedly:

screen shot 2017-05-02 at 2 18 00 pm

Notes:

tleyden commented 7 years ago

According to @mcpaddy, tcpdump suggests the connection is being closed on the memcached side. To get memcached to output more verbose logs:

tleyden commented 7 years ago

cb collect info with ./mcctl set verbosity debug

cb collect info with ./mcctl set verbosity detail

tleyden commented 7 years ago

I tried bumping to the latest go-couchbase revision, but running into:

2017-05-02T18:47:48.217-04:00 DCP: OnError: MCResponse status=EACCESS, opcode=0x89, opaque=0, msg: No access.
2017-05-02T18:47:48.217-04:00 WARNING: Feed%!(EXTRA string=Error processing DCP stream: %v, *gomemcached.MCResponse=MCResponse status=EACCESS, opcode=0x89, opaque=0, msg: No access) -- base.(*DCPReceiver).OnError() at dcp_feed.go:125

which is a potential auth issue

tleyden commented 7 years ago

Repro 2 bucket cbdatasource issue:

tleyden commented 7 years ago

Study why we apparently lose changes if we ignore OnError().

tleyden commented 7 years ago

re: https://github.com/couchbase/sync_gateway/issues/2423#issuecomment-297596487 -- I filed this in https://github.com/couchbase/sync_gateway/issues/2514

tleyden commented 7 years ago

After workaround for #2514, the following functional tests are still failing locally:

tleyden commented 7 years ago

Investigating why the bucket shadowing functional tests are failing locally looks to be due to:

2017-05-04T06:33:52.322-04:00 WARNING: Database "db": unable to connect to external bucket for shadowing: Bucket Shadowing only works with the TAP feed type. Please add feed_type=TAP and retry.  See https://github.com/couchbase/sync_gateway/issues/2502 -- rest.(*ServerContext)._getOrAddDatabaseFromConfig() at server_context.go:547
tleyden commented 7 years ago

Investigating why the test_db_offline_tap_loss_sanity[bucket_online_offline/bucket_online_offline_default_dcp-100] test is failing (Couchbase server 4.6.1)

In the SG accel logs I see these auth errors:

_time=2017-05-05T16:08:53.403-04:00 _level=INFO _msg=Using plain authentication for user data-bucket
2017/05/05 16:08:53 feed_dcp: OnError, name: data-bucketd5c7e61834375af11203256e50a83aaa_60ed5456e66ad70a: bucketName: data-bucket, bucketUUID: d5c7e61834375af11203256e50a83aaa, err: worker auth, server: 192.168.33.10:11210, user: data-bucket, err: MCResponse status=AUTH_ERROR, opcode=SASL_AUTH, opaque=0, msg: Auth failure
2017/05/05 16:08:55 feed_dcp: OnError, name: data-bucketd5c7e61834375af11203256e50a83aaa_60ed5456e66ad70a: bucketName: data-bucket, bucketUUID: d5c7e61834375af11203256e50a83aaa, err: HTTP error 401 Unauthorized getting "http://192.168.33.10:8091/pools":
2017/05/05 16:08:55 feed_dcp: OnError, name: data-bucketd5c7e61834375af11203256e50a83aaa_60ed5456e66ad70a: bucketName: data-bucket, bucketUUID: d5c7e61834375af11203256e50a83aaa, err: could not connect to any serverURL: []string{"http://192.168.33.10:8091"}
_time=2017-05-05T16:08:55.405-04:00 _level=INFO _msg=Using plain authentication for user data-bucket
2017/05/05 16:08:55 feed_dcp: OnError, name: data-bucketd5c7e61834375af11203256e50a83aaa_60ed5456e66ad70a: bucketName: data-bucket, bucketUUID: d5c7e61834375af11203256e50a83aaa, err: worker auth, server: 192.168.33.10:11210, user: data-bucket, err: MCResponse status=AUTH_ERROR, opcode=SASL_AUTH, opaque=0, msg: Auth failure
2017/05/05 16:08:57 feed_dcp: OnError, name: data-bucketd5c7e61834375af11203256e50a83aaa_60ed5456e66ad70a: bucketName: data-bucket, bucketUUID: d5c7e61834375af11203256e50a83aaa, err: HTTP error 401 Unauthorized getting "http://192.168.33.10:8091/pools":
2017/05/05 16:08:57 feed_dcp: OnError, name: data-bucketd5c7e61834375af11203256e50a83aaa_60ed5456e66ad70a: bucketName: data-bucket, bucketUUID: d5c7e61834375af11203256e50a83aaa, err: could not connect to any serverURL: []string{"http://192.168.33.10:8091"}
_time=2017-05-05T16:08:57.407-04:00 _level=INFO _msg=Using plain authentication for user data-bucket
2017/05/05 16:08:57 feed_dcp: OnError, name: data-bucketd5c7e61834375af11203256e50a83aaa_60ed5456e66ad70a: bucketName: data-bucket, bucketUUID: d5c7e61834375af11203256e50a83aaa, err: worker auth, server: 192.168.33.10:11210, user: data-bucket, err: MCResponse status=AUTH_ERROR, opcode=SASL_AUTH, opaque=0, msg: Auth failure
2017/05/05 16:08:59 feed_dcp: OnError, name: data-bucketd5c7e61834375af11203256e50a83aaa_60ed5456e66ad70a: bucketName: data-bucket, bucketUUID: d5c7e61834375af11203256e50a83aaa, err: HTTP error 401 Unauthorized getting "http://192.168.33.10:8091/pools":
2017/05/05 16:08:59 feed_dcp: OnError, name: data-bucketd5c7e61834375af11203256e50a83aaa_60ed5456e66ad70a: bucketName: data-bucket, bucketUUID: d5c7e61834375af11203256e50a83aaa, err: could not connect to any serverURL: []string{"http://192.168.33.10:8091"}
_time=2017-05-05T16:08:59.410-04:00 _level=INFO _msg=Using plain authentication for user data-bucket
2017/05/05 16:08:59 feed_dcp: OnError, name: data-bucketd5c7e61834375af11203256e50a83aaa_60ed5456e66ad70a: bucketName: data-bucket, bucketUUID: d5c7e61834375af11203256e50a83aaa, err: worker auth, server: 192.168.33.10:11210, user: data-bucket, err: MCResponse status=AUTH_ERROR, opcode=SASL_AUTH, opaque=0, msg: Auth failure
2017/05/05 16:09:01 feed_dcp: OnError, name: data-bucketd5c7e61834375af11203256e50a83aaa_60ed5456e66ad70a: bucketName: data-bucket, bucketUUID: d5c7e61834375af11203256e50a83aaa, err: HTTP error 401 Unauthorized getting "http://192.168.33.10:8091/pools":
2017/05/05 16:09:01 feed_dcp: OnError, name: data-bucketd5c7e61834375af11203256e50a83aaa_60ed5456e66ad70a: bucketName: data-bucket, bucketUUID: d5c7e61834375af11203256e50a83aaa, err: could not connect to any serverURL: []string{"http://192.168.33.10:8091"}

Full SGA logs: https://gist.github.com/tleyden/3b9c04d560a49a41cbb98bf7eef676db

In the SG logs I see these errors:

2017-05-05T16:03:02.798-04:00 WARNING: RetryLoop for Get _sync:role:radio_stations giving up after 11 attempts -- base.RetryLoop() at util.go:296
2017-05-05T16:03:02.798-04:00 HTTP: #815:     --> 503 Database timeout error (gocb.ErrTimeout)  (32632.8 ms)
2017-05-05T16:03:02.802-04:00 HTTP:  #816: GET /db/_role/  (ADMIN)
2017-05-05T16:03:02.803-04:00 HTTP: #816:     --> 500 Internal error: not_found - missing  (1.2 ms)
2017-05-05T16:03:02.806-04:00 HTTP:  #817: GET /db/_role/radio_stations  (ADMIN)

More logs: https://gist.github.com/tleyden/0d639989d943ae3550731f0076f442dc

tleyden commented 7 years ago

I think I found the issue in https://github.com/couchbase/sync_gateway/issues/2423#issuecomment-299980951:

2017-05-05T17:20:06.967-04:00 Enabling logging: [CRUD+ Cache+ HTTP+ Changes+]
2017-05-05T17:20:06.967-04:00 ==== Couchbase Sync Gateway/feature/remove_gocb_hybrid(85d5add+CHANGES) ====
2017-05-05T17:20:06.967-04:00 requestedSoftFDLimit >= currentHardFdLimit (90000 >= 65535) capping at 65535
2017-05-05T17:20:06.967-04:00 Configured process to allow 65535 open file descriptors
2017-05-05T17:20:06.967-04:00 Opening db /db as bucket "data-bucket", pool "default", server <http://192.168.33.10:8091>
2017-05-05T17:20:06.967-04:00 GoCBCustomSGTranscoder Opening Couchbase database data-bucket on <http://192.168.33.10:8091> as user "data-bucket"
2017-05-05T17:20:07.024-04:00 GoCB Opening Couchbase database index-bucket on <http://192.168.33.10:8091> as user "index-bucket"
2017-05-05T17:20:07.035-04:00 GoCB Opening Couchbase database index-bucket on <http://192.168.33.10:8091> as user "index-bucket"
2017-05-05T17:20:07.041-04:00 WARNING: Error type asserting base.Bucket -> CouchbaseBucket: baseBucket &{0xc420072f30 {http://192.168.33.10:8091 default data-bucket dcpshard 0xc4200b03c0 2 10 5 false} 0xc4200f7c20 0xc4200f7c80 0xc4200f7ce0} is not a CouchbaseBucket.  Was type: *base.CouchbaseBucketGoCB.  Skipping VerifyBucketSequenceParity(), which checks for invalid index bucket state -- base.VerifyBucketSequenceParity() at util.go:504
goroutine 1 [running]:
runtime/debug.Stack(0x7, 0xc420244c60, 0x15d)
    /usr/local/go/src/runtime/debug/stack.go:24 +0x79
runtime/debug.PrintStack()
    /usr/local/go/src/runtime/debug/stack.go:16 +0x22
github.com/couchbase/sync_gateway/base.VerifyBucketSequenceParity(0x1077be0, 0xc420296f00, 0x107a7c0, 0xc420072fc0, 0x0, 0x0)
    /Users/tleyden/Development/sync_gateway/godeps/src/github.com/couchbase/sync_gateway/base/util.go:505 +0x14e
github.com/couchbase/sync_gateway/db.(*kvChangeIndexReader).verifyBucketSequenceParity(0xc42024aa00, 0xc42010e900, 0x0, 0x0)
    /Users/tleyden/Development/sync_gateway/godeps/src/github.com/couchbase/sync_gateway/db/kv_change_index_reader.go:152 +0x6d
github.com/couchbase/sync_gateway/db.(*kvChangeIndexReader).Init(0xc42024aa00, 0xc420198300, 0xc4201796c0, 0xc4202b2930, 0xc4202b2940, 0xc42010e900, 0x10, 0xbac5e0)
    /Users/tleyden/Development/sync_gateway/godeps/src/github.com/couchbase/sync_gateway/db/kv_change_index_reader.go:86 +0x343
github.com/couchbase/sync_gateway/db.(*kvChangeIndex).Init(0xc420296570, 0xc42010e900, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /Users/tleyden/Development/sync_gateway/godeps/src/github.com/couchbase/sync_gateway/db/kv_change_index.go:53 +0x114

sg-accel was using GoCB (with CustomSGTranscoder) to open the data-bucket, even though the feedtype was set to DCPSHARD

tleyden commented 7 years ago

https://github.com/couchbase/sync_gateway/issues/2423#issuecomment-299980951 is fixed now.

New failing test:

    # {"mode": "di", "suite":DEFAULT_SUITE, "testname":"test_db_online_offline_webhooks_offline_two[sync_gateway_webhook-5-1-1-2]"},

fails with logs:

I kicked off a functional test on jenkins to see if it's just a local setup issue:

http://uberjenkins.sc.couchbase.com/job/cen7-sync-gateway-functional-tests-base-di/491/

tleyden commented 7 years ago

Failing sg-accel unit test:

./test.sh -run TestInterruptedBackfillWithWritesAndHistory

error:

fine users in the configuration json or via the REST API on the admin port, and grant users to channels via the admin_channels parameter. -- rest.emitAccessRelatedWarnings() at server_context.go:919
2017-05-05T18:58:59.428-04:00 Opening Walrus database sg_index_bucket on <walrus:>
2017-05-05 18:58:59.428216 I | it: {restTester:{_bucket:<nil> _sc:0xc420117400 noAdminParty:false distributedIndex:false syncFn:function(doc) {channel(doc.channel); access(doc.accessUser, doc.accessChannel);} cacheConfig:<nil>} mockSequenceGenerator:0xc420115c80 _indexBucket:<nil> accelContext:0xc420115d20 dbName:db serverName:walrus: numVbuckets:1024 numShards:16 incomingMockSgOps:0xc4200fd260 indexWriter:<nil>}
2017-05-05 18:58:59.428238 I | it serverContext: &{config:0xc42011cd80 databases_:map[db:0xc42011e900] lock:{w:{state:0 sema:0} writerSem:0 readerSem:0 readerCount:0 readerWait:0} statsTicker:<nil> HTTPClient:0x137e020 replicator:0xc420179d10}
2017-05-05 18:58:59.432874 I | updateAndFetchUser calling endpoint: /db/_user/david
2017-05-05 18:58:59.606414 I | getUser calling endpoint: /db/_user/david
2017-05-05T18:58:59.609-04:00 Initializing index partitions from bucket, found 16 partitions
2017-05-05T18:58:59.751-04:00 Changes+: Vector multi changes feed...
2017-05-05T18:58:59.751-04:00 Changes+: Vector MultiChangesFeed({*}, {Since:0 Limit:0 Conflicts:true IncludeDocs:false Wait:false Continuous:false Terminator:0xc420c38720 HeartbeatMs:0 TimeoutMs:5000 ActiveOnly:false}) ...   (to david)
2017-05-05T18:58:59.752-04:00 Changes+: MultiChangesFeed: channels expand to "!:747.1" ...   (to david)
2017-05-05T18:58:59.752-04:00 Changes+: GotChannelSince... !:747.1
2017-05-05T18:58:59.752-04:00 Changes+: Starting for channel... !, 747.1
2017-05-05T18:58:59.752-04:00 Changes+: Starting backfill for channel... !, Since:[0:0], TriggeredBy:[747:1]
2017-05-05T18:58:59.752-04:00 Changes+: [changesFeed] Found 0 non-backfill changes for channel !
2017-05-05T18:58:59.752-04:00 Changes+: MultiChangesFeed done   (to david)
2017-05-05 18:58:59.752441 I | getChanges response: {"results":[
],
"last_seq":"0"}

    index_writer_changes_test.go:2309
    assert.Equals(t, len(results), 2)

    expected: 2
         got: 0
--- FAIL: TestInterruptedBackfillWithWritesAndHistory (0.36s)
    assert.go:31:
panic: runtime error: index out of range [recovered]
    panic: runtime error: index out of range