Closed tleyden closed 7 years ago
TestChannelAccessChanges fails with: https://gist.github.com/tleyden/1ffa96ff109df5c4dd30b53f45da2198
TestContinuousChangesBackfill fails with: https://gist.github.com/tleyden/e45c37520501ea23afc6ed8ab55aeb04
vs successful logs:
https://gist.github.com/tleyden/83f73db80f02441d3742accbc510951d
I repro'd this by running a single test against couchbase server:
https://gist.github.com/tleyden/5a0c69da17e4641927fc4d81f2a7d431
This was couchbase 5 under docker. I ended up switching to couchbase 4 running locally.
TestAttachmentForRejectedDocument failing locally against couchbase 4.1.
https://gist.github.com/tleyden/b1636309ae542500beefa72b9c6a287c
Is DCP close() async? Then we we flush the bucket we get dcp rollback.
Maybe no defer close for bucket?
Latest failure logs: https://gist.github.com/tleyden/d5e24af38169f11e41ace05a0d7d1d0f
=== RUN TestFilterToAvailableSince/RoleAndRoleChannelGrantAfterSince_ChannelGrantFirst
2017-10-03 13:58:56.802800 I | NOT Using xattrs: strings.ToLower(useXattrs) != strings.ToLower(TestEnvSyncGatewayTrue). |false| != |true|
2017-10-03 13:58:56.804545 I | Flushing test bucket: test_data_bucket
2017-10-03 13:58:57.410726 I | workerReadOwnWrites completed without errors
2017-10-03T13:58:57.410-07:00 GoCBCustomSGTranscoder Opening Couchbase database test_data_bucket on <http://localhost:8091> as user "test_data_bucket"
auth_test.go:549
assert.Equals(t, err, nil)
expected: <nil>
got: &gocbcore.KvError{Code:0x86, Name:"", Description:"temporary failure occurred, try again later", Context:"", Ref:""}
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x98 pc=0x8cb62]
goroutine 795 [running]:
panic(0x492700, 0xc4200160f0)
/usr/local/go/src/runtime/panic.go:500 +0x1a1
testing.tRunner.func1(0xc4200a4540)
/usr/local/go/src/testing/testing.go:579 +0x25d
panic(0x492700, 0xc4200160f0)
/usr/local/go/src/runtime/panic.go:458 +0x243
github.com/couchbase/sync_gateway/auth.TestFilterToAvailableSince.func1(0xc4200a4540)
/Users/tleyden/Development/sync_gateway/godeps/src/github.com/couchbase/sync_gateway/auth/auth_test.go:554 +0x492
testing.tRunner(0xc4200a4540, 0xc420536f60)
/usr/local/go/src/testing/testing.go:610 +0x81
created by testing.(*T).Run
/usr/local/go/src/testing/testing.go:646 +0x2ec
FAIL github.com/couchbase/sync_gateway/auth 46.532s
Created separate ticket for adding additional retry handling: https://github.com/couchbase/sync_gateway/issues/2949
Another failure from https://gist.github.com/tleyden/d5e24af38169f11e41ace05a0d7d1d0f
2017-10-03T14:06:28.760-07:00 Changes+: Int sequence multi changes feed...
2017-10-03T14:06:28.760-07:00 Changes: MultiChangesFeed(channels: {*}, options: {Since:0 Limit:0 Conflicts:false IncludeDocs:false Wait:false Continuous:false Terminator:0xc42268f3e0 HeartbeatMs:0 TimeoutMs:300000 ActiveOnly:false}) ... (to zegpold)
2017-10-03T14:06:28.760-07:00 Changes: simple changes cannot get Close Notifier from ResponseWriter
2017-10-03T14:06:28.760-07:00 Changes+: MultiChangesFeed: channels expand to "!:1,alpha:9,gamma:4,zero:1" ... (to zegpold)
2017-10-03T14:06:28.760-07:00 Cache: getCachedChanges("gamma", 4:0) --> 1 changes valid from #1
2017-10-03T14:06:28.760-07:00 Changes+: [changesFeed] Found 1 changes for channel gamma
2017-10-03T14:06:28.760-07:00 Cache: getCachedChanges("zero", 0) --> 0 changes valid from #1
2017-10-03T14:06:28.760-07:00 Changes+: [changesFeed] Found 0 changes for channel zero
2017-10-03T14:06:28.760-07:00 Cache: getCachedChanges("!", 0) --> 0 changes valid from #1
2017-10-03T14:06:28.760-07:00 Changes+: [changesFeed] Found 0 changes for channel !
2017-10-03T14:06:28.760-07:00 Changes+: Grant for channel [alpha] is after the current sequence - skipped for this iteration. Grant:[9] Current:[8] (to zegpold)
2017-10-03T14:06:28.760-07:00 Changes+: Channel feed processing seq:8 in channel gamma (to zegpold)
2017-10-03T14:06:28.760-07:00 Changes+: MultiChangesFeed sending {Seq:8, ID:g1, Changes:[map[rev:1-97023f8d0f6031c1587c483ff7f5a2ac]]} (to zegpold)
2017-10-03T14:06:28.760-07:00 Changes: MultiChangesFeed done (to zegpold)
2017-10-03 14:06:28.760771 I | //////// _changes for zegpold looks like: {"results":[
{"seq":8,"id":"g1","changes":[{"rev":"1-97023f8d0f6031c1587c483ff7f5a2ac"}]}
],
"last_seq":"8"}
api_test.go:1682
assert.Equals(t, len(changes.Results), 2)
expected: 2
got: 1
2017-10-03 14:06:28.763531 I | DatabaseContext.Close() called
It looks like there needs to be additional "wait for changes" retry loops added to the test.
Regarding previous comment, I've added more "wait for changes" to that test: 5c46825d7bc0904da7fd8616622b00bdcf9b4715
Another test failure: https://gist.github.com/tleyden/814fcad420a21afcc5d26a6a5f5a134c
--- FAIL: TestContinuousChangesBackfill (4.34s)
panic: changeCache: Sequence 13 never showed up! [recovered]
panic: changeCache: Sequence 13 never showed up!
goroutine 805 [running]:
panic(0x55ee40, 0xc420204df0)
/usr/local/go/src/runtime/panic.go:500 +0x1a1
testing.tRunner.func1(0xc42027e840)
/usr/local/go/src/testing/testing.go:579 +0x25d
panic(0x55ee40, 0xc420204df0)
/usr/local/go/src/runtime/panic.go:458 +0x243
github.com/couchbase/sync_gateway/db.(*changeCache).waitForSequence(0xc4202e8e10, 0xd)
/Users/tleyden/Development/sync_gateway/godeps/src/github.com/couchbase/sync_gateway/db/change_cache.go:289 +0x192
github.com/couchbase/sync_gateway/db.(*changeCache).waitForSequenceID(0xc4202e8e10, 0x0, 0x0, 0x0, 0xd, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/Users/tleyden/Development/sync_gateway/godeps/src/github.com/couchbase/sync_gateway/db/change_cache.go:274 +0x35
github.com/couchbase/sync_gateway/db.TestContinuousChangesBackfill(0xc42027e840)
/Users/tleyden/Development/sync_gateway/godeps/src/github.com/couchbase/sync_gateway/db/change_cache_test.go:430 +0x8d6
testing.tRunner(0xc42027e840, 0x6d9770)
/usr/local/go/src/testing/testing.go:610 +0x81
created by testing.(*T).Run
/usr/local/go/src/testing/testing.go:646 +0x2ec
FAIL github.com/couchbase/sync_gateway/db 10.108s
re:
Is DCP close() async? Then we we flush the bucket we get dcp rollback.
I saw this in the cbdatasource.go
code:
// TODO: By this point, worker goroutines may still be going, but // should end soon. Instead Close() should be 100% synchronous. return nil
Looks like in the current master branch it hasn't changed:
https://github.com/couchbase/go-couchbase/blob/master/cbdatasource/cbdatasource.go#L1677-L1679
I filed https://issues.couchbase.com/browse/MB-26262 to see if the cbdatasource maintainers think that's an issue.
Latest failure: https://gist.github.com/tleyden/daa6bdb9c4320dc9157cb0bfa2fdf4e0
This failure demonstrates how the dcp errors from the previous test happen during the test cleanup of following test:
https://gist.github.com/tleyden/b046d9a6780adce6e574cfa9680db273
In 1c2088e84ab344ee946b5dd7bd431d81282487f0 I added a dcp receiver id to try to pinpoint which dcp receiver was getting the callbacks.
From https://gist.github.com/tleyden/c4077efc194ab67fa0a86a30f89b49d3 it's clear that it's the dcp receiver from the previous test getting the dcp rollback callbacks.
Also, I'm able to repro it quickly + reliably using TestReproduceTestResidue()
These results demonstrate the interleaving of the dcp receivers: https://gist.github.com/tleyden/1006df094374f6d218f61ff859f11667#file-gistfile1-txt-L355-L358
It looks like cbdatasource.Close()
is never even getting called during TestReproduceTestResidue ()
Working backwards up the stack:
dcp_feed.go#StartDCPFeed()
, args.Terminator
is null and so it does not start the goroutine which listens for the channel to be closed and calls cbdatasource.Close()
change_listener.go#Start()
, it creates a sgbucket.FeedArguments
struct with no Terminator
, which causes it to be nil in dcp_feed.go#StartDCPFeed()
Still seeing failures even after the fix for #2952: https://gist.github.com/tleyden/4789cdac626589f3c7b3ccdeea6714ba
Still seeing dcp receiver callback residue. I believe some of the tests are missing cleanup.
I ran the full suite on the jenkins slave on the cli, and ran into a failure with: TestLowSequenceHandling
https://gist.github.com/tleyden/9c48a73971f2a66abef125797414b784#file-gistfile1-txt-L95-L101
Looks like this also needs the "changes feed retry loop"
The last test run failed with logs:
2017-10-05T11:02:54.006-07:00 WARNING: Error flushing bucket: {"_":"Flush failed with unexpected error. Check server logs for details."} Will retry. -- base.(*TestBucketManager).EmptyTestBucket.func1() at util_testing.go:287
2017-10-05T11:02:57.910-07:00 WARNING: Error inserting key to recently flushed bucket: temporary failure occurred, try again later Retrying. -- base.(*TestBucketManager).EmptyTestBucket.func3() at util_testing.go:370
2017-10-05T11:02:58.010-07:00 WARNING: Error inserting key to recently flushed bucket: temporary failure occurred, try again later Retrying. -- base.(*TestBucketManager).EmptyTestBucket.func3() at util_testing.go:370
2017-10-05T11:02:58.211-07:00 WARNING: Error inserting key to recently flushed bucket: temporary failure occurred, try again later Retrying. -- base.(*TestBucketManager).EmptyTestBucket.func3() at util_testing.go:370
2017-10-05T11:02:58.612-07:00 WARNING: Error inserting key to recently flushed bucket: temporary failure occurred, try again later Retrying. -- base.(*TestBucketManager).EmptyTestBucket.func3() at util_testing.go:370
2017-10-05 11:02:59.413392 I | workerReadOwnWrites completed without errors
2017-10-05T11:02:59.415-07:00 GoCBCustomSGTranscoder Opening Couchbase database test_data_bucket on <http://integration-cbs.sc.couchbase.com:8091> as user "test_data_bucket"
2017-10-05T11:03:04.561-07:00 WARNING: RetryLoop for CouchbaseBucketGoCB Set() giving up after 11 attempts -- base.RetryLoop() at util.go:298
I'm doing a cbcollect info to try to find out more information on why these temporary failures are happening.
From @jamesmauss in support:
looks like you were hitting https://issues.couchbase.com/browse/MB-25803 fixed in releases after 3505
I'm upgrading to a more recent version of couchbase server
I installed couchbase server locally on the jenkins slave where the integration tests were running, to see if it made any difference.
Here are the latest failures:
https://gist.github.com/tleyden/898275de7a946e986de249435aac373d#file-gistfile1-txt-L247-L292
I also ran a cbcollectinfo to see if it reveals anything.
From these logs it looks like sg-replicate replicators are started in some tests and never shut down:
https://gist.github.com/tleyden/1de5f235f5163acba06fab7fb58a8de4
While running locally I saw the following failure:
https://gist.github.com/tleyden/f65bd202f4e79503da3308e02e552d19
From the dump of goroutines, it looks like gocb connections aren't being cleaned up, since the dump contains thousands of stacks that contain gocb. That might explain some of the temporary failures.
goroutine 3075 [select, 277 minutes]:
net/http.(*persistConn).writeLoop(0xc4205e8400)
/usr/local/go/src/net/http/transport.go:1646 +0x3bd
created by net/http.(*Transport).dialConn
/usr/local/go/src/net/http/transport.go:1063 +0x50e
goroutine 3512 [chan receive, 277 minutes]:
gopkg.in/couchbase/gocbcore%2ev7.(*memdPipelineClient).ioLoop.func1(0xc42034aca0, 0xf, 0xc4204b4500, 0xc420b40190, 0xc420d80000)
/Users/tleyden/Development/sync_gateway/godeps/src/gopkg.in/couchbase/gocbcore.v7/memdpipeline.go:192 +0x129
created by gopkg.in/couchbase/gocbcore%2ev7.(*memdPipelineClient).ioLoop
/Users/tleyden/Development/sync_gateway/godeps/src/gopkg.in/couchbase/gocbcore.v7/memdpipeline.go:205 +0xfb
goroutine 3069 [GC assist wait]:
gopkg.in/couchbase/gocbcore%2ev7.(*memdTcpConn).ReadPacket(0xc420688370, 0xc420ba5030, 0xf, 0x2)
/Users/tleyden/Development/sync_gateway/godeps/src/gopkg.in/couchbase/gocbcore.v7/memdconn.go:153 +0xd5
gopkg.in/couchbase/gocbcore%2ev7.(*memdClient).run.func2(0xc4206883c0, 0xc4203b42a0, 0xc4203ba540)
/Users/tleyden/Development/sync_gateway/godeps/src/gopkg.in/couchbase/gocbcore.v7/memdclient.go:147 +0xdf
created by gopkg.in/couchbase/gocbcore%2ev7.(*memdClient).run
/Users/tleyden/Development/sync_gateway/godeps/src/gopkg.in/couchbase/gocbcore.v7/memdclient.go:198 +0xce
I'm going to rerun and try to keep my eye on the total number of server connections reported by couchbase.
Result of running TestReproduceTestResidue and dumping goroutines:
https://gist.github.com/tleyden/d5b92053b98f12dbe0153959d20061d4
It looks like a lot of the resource leakage from TestReproduceTestResidue
is due to the fact that this goroutine never seems to exit:
func (db *Database) SimpleMultiChangesFeed(chans base.Set, options ChangesOptions) (<-chan *ChangeEntry, error) {
to := ""
if db.user != nil && db.user.Name() != "" {
to = fmt.Sprintf(" (to %s)", db.user.Name())
}
base.LogTo("Changes", "MultiChangesFeed(channels: %s, options: %+v) ... %s", chans, options, to)
output := make(chan *ChangeEntry, 50)
go func() {
defer func() {
base.LogTo("Changes", "MultiChangesFeed done %s", to)
close(output)
}()
I'm not seeing "MultiChangesFeed done" in the logs anywhere.
After c00b11ac4fffca1af01b8a4e7916ac6e118dafdd, now the change_listener.go:203
lines are no longer accumulating in the goroutine stack dumps, and "MultiChangesFeed done" is in the logs. Consider this change a "proof of concept" to open a discussion, not necessarily the best approach to fix this.
New logs: https://gist.github.com/tleyden/d4825dcb31d55421307a031f286bf1a2
I'm noticing that change_cache.go goroutine background tasks are hanging around longer than necessary. From https://gist.github.com/tleyden/d4825dcb31d55421307a031f286bf1a2
goroutine 181 [sleep]:
time.Sleep(0xdf8475800)
/usr/local/go/src/runtime/time.go:59 +0xe1
github.com/couchbase/sync_gateway/db.(*changeCache).Init.func2(0xc4200dc2d0)
/Users/tleyden/Development/sync_gateway/godeps/src/github.com/couchbase/sync_gateway/db/change_cache.go:126 +0x43
created by github.com/couchbase/sync_gateway/db.(*changeCache).Init
/Users/tleyden/Development/sync_gateway/godeps/src/github.com/couchbase/sync_gateway/db/change_cache.go:130 +0x285
I cherry picked the commits from #2962 onto the branch for this issue. After those changes, plus the changes mentioned in https://github.com/couchbase/sync_gateway/issues/2938#issuecomment-334879394, all of the goroutines hanging around after running SG_TEST_BACKING_STORE=Couchbase SG_TEST_COUCHBASE_SERVER_URL=http://localhost:8091 SG_TEST_USE_XATTRS=false ./test.sh -run TestReproduceTestResidue
are basically gocb related:
https://gist.github.com/tleyden/ef4ce9ed67dee5c12a745de8b565842d#file-gistfile1-txt-L3416-L4663
After the fix for https://github.com/couchbase/sync_gateway/issues/2965, the full suite runs for me locally when I run it package by package:
echo "test-rest"; SG_TEST_BACKING_STORE=Couchbase SG_TEST_COUCHBASE_SERVER_URL=http://localhost:8091 SG_TEST_USE_XATTRS=false go test -timeout 180m -v github.com/couchbase/sync_gateway/rest && echo "test-auth"; SG_TEST_BACKING_STORE=Couchbase SG_TEST_COUCHBASE_SERVER_URL=http://localhost:8091 SG_TEST_USE_XATTRS=false go test -timeout 180m -v github.com/couchbase/sync_gateway/auth && echo "test-base"; SG_TEST_BACKING_STORE=Couchbase SG_TEST_COUCHBASE_SERVER_URL=http://localhost:8091 SG_TEST_USE_XATTRS=false go test -timeout 180m -v github.com/couchbase/sync_gateway/base && echo "test-channels"; SG_TEST_BACKING_STORE=Couchbase SG_TEST_COUCHBASE_SERVER_URL=http://localhost:8091 SG_TEST_USE_XATTRS=false go test -timeout 180m -v github.com/couchbase/sync_gateway/channels && echo "test-db"; SG_TEST_BACKING_STORE=Couchbase SG_TEST_COUCHBASE_SERVER_URL=http://localhost:8091 SG_TEST_USE_XATTRS=false go test -timeout 180m -v github.com/couchbase/sync_gateway/db
However when I run via SG_TEST_BACKING_STORE=Couchbase SG_TEST_COUCHBASE_SERVER_URL=http://localhost:8091 SG_TEST_USE_XATTRS=false ./test.sh
, I'm seeing failures: https://gist.github.com/4b76950411f0f554a03452d00b4ec43d. Still investigating.
After cleaning up quite a few places where buckets were left open, I'm still seeing dcp rollback errors and some failing tests:
https://gist.github.com/tleyden/f4ff3a3c2e95a8708559be7de5c174d2
Another rollback related failure:
SG_TEST_BACKING_STORE=Couchbase SG_TEST_COUCHBASE_SERVER_URL=http://localhost:8091 SG_TEST_USE_XATTRS=false ./test.sh
https://gist.github.com/tleyden/82b3c5184d6724dd4210bd9c598a0b71
On a jenkins slave, it failed trying to flush the bucket and read it's own writes:
https://gist.github.com/tleyden/58df24fd97b387db63f4bbad3b92d0d3
Logs with more dcp logging enabled: https://gist.github.com/tleyden/d0038d3103b95bb8bba4a791819a09d3
Log snippet:
2017-10-11T15:23:22.811-07:00 WARNING: Using DCP Logging Receiver. Receiver: 11 -- base.NewDCPReceiver() at dcp_feed.go:112
2017-10-11 15:23:22.811425 I | Starting dcp feed. args.Backfill: 18446744073709551615
2017-10-11 15:23:22.811430 I | For non-backfill, use vbucket uuids, high sequence numbers
2017-10-11 15:23:22.812057 I | Seeding seqnos: map[59:2 32:1]
....
2017-10-11T15:23:23.143-07:00 WARNING: Got unrecoverable error <nil> trying to get key _sync:user:naomi. Not going to retry. -- base.CouchbaseBucketGoCB.Get.func2() at bucket_gocb.go:303
2017-10-11T15:23:23.143-07:00 WARNING: Got unrecoverable error <nil> trying to get key _sync:user:naomi. Not going to retry. -- base.CouchbaseBucketGoCB.Get.func2() at bucket_gocb.go:303
2017-10-11T15:23:23.737-07:00 WARNING: DCP RollbackEx request - rolling back DCP feed for: vbucketId: 59, rollbackSeq: 0. Receiver: 11 -- base.(*DCPReceiver).RollbackEx() at dcp_feed.go:254
2017-10-11T15:23:23.742-07:00 WARNING: DCP RollbackEx request - rolling back DCP feed for: vbucketId: 49, rollbackSeq: 0. Receiver: 11 -- base.(*DCPReceiver).RollbackEx() at dcp_feed.go:254
2017-10-11T15:23:23.747-07:00 WARNING: DCP RollbackEx request - rolling back DCP feed for: vbucketId: 32, rollbackSeq: 0. Receiver: 11 -- base.(*DCPReceiver).RollbackEx() at dcp_feed.go:254
2017-10-11T15:23:23.766-07:00 WARNING: DCP RollbackEx request - rolling back DCP feed for: vbucketId: 8, rollbackSeq: 0. Receiver: 11 -- base.(*DCPReceiver).RollbackEx() at dcp_feed.go:254
2017-10-11T15:23:23.768-07:00 WARNING: DCP RollbackEx request - rolling back DCP feed for: vbucketId: 5, rollbackSeq: 0. Receiver: 11 -- base.(*DCPReceiver).RollbackEx() at dcp_feed.go:254
It's interesting that it gets a rolling back DCP feed for: vbucketId: 49, rollbackSeq: 0
message, even though the seqnos map didn't include vbucket 49
I ran a wireshark capture during this run: https://gist.github.com/tleyden/54b5028c57ff22d873a182a1d14fdbe2
The main thing that jumped out at me was that the response for the vbucket seq numbers is showing some very high numbers:
and I was expecting these to be 0, or see a retry loop until they are 0.
Also I found some code that I'd added to flush() the bucket within that loop to wait until the vb seq numbers are 0, and I think it's better to remove that.
_, highSeqnos, err := GetStatsVbSeqno(stats, maxVbno, false)
if err != nil {
return false, err, nil
}
for vbNo := uint16(0); vbNo < maxVbno; vbNo++ {
maxSeqForVb := highSeqnos[vbNo]
if maxSeqForVb > 0 {
log.Printf("max seq for vb %d > 0 (%d). Flusing and retrying", vbNo, maxSeqForVb)
err = tbm.BucketManager.Flush()
if err != nil {
Warn("Error flushing bucket: %v", err)
}
return true, nil, nil
}
}
@adamcfraser when calling GetStatsVbSeqno()
, it's currently passing useAbsHighSeqNo
as false. Should it be passing true? I'm having trouble remembering what that means, and can't find any doc comments on it.
That flag determines whether we use the vb_nn:abs_high_seqno or vb_nn:high_seqno property from the stats response. Originally added w/ https://github.com/couchbase/sync_gateway/commit/3c3a0739afe402565ce17893bc5dddd7ea4a5a32
You could dig into the history there, but if you were using the same stats call to validate the flush (as mentioned above), why is it returning zeros in one case and not the other?
Another run after making some fixes -- one which made it stop reusing the (stateful) sleeper function, which looks like it was preventing certain retries from happening when they should have:
The most salient aspect of this run is that I'm seeing an unexpected bucket flush happening during the test run of the TestUserDeleteDuringChangesWithAccess
test that starts at 2017-10-11 17:19:56.160616:
In the last run there seems to be signs that test functionality from other tests are bleeding into the failing test.
In frame 132415 of the capture, there is a doc mutation on doc-5
, which overlaps with this in the logs:
2017-10-11 18:11:20.745273 I | max seq for vb 59 > 0 (2). Retrying
2017-10-11 18:11:20.756700 I | max seq for vb 59 > 0 (2). Retrying
2017-10-11 18:11:20.782237 I | max seq for vb 59 > 0 (2). Retrying
2017-10-11 18:11:20.825898 I | max seq for vb 59 > 0 (2). Retrying
2017-10-11 18:11:20.909486 I | max seq for vb 20 > 0 (1). Retrying
2017-10-11 18:11:21.073835 I | max seq for vb 20 > 0 (1). Retrying
2017-10-11 18:11:21.399066 I | max seq for vb 20 > 0 (1). Retrying
2017-10-11 18:11:22.043657 I | max seq for vb 48 > 0 (1). Retrying
2017-10-11 18:11:23.327073 I | max seq for vb 48 > 0 (1). Retrying
2017-10-11 18:11:25.891282 I | max seq for vb 5 > 0 (1). Retrying
2017-10-11 18:11:31.016160 I | max seq for vb 59 > 0 (2). Retrying
2017-10-11 18:11:41.260437 I | max seq for vb 59 > 0 (2). Retrying
2017-10-11 18:12:01.746042 I | max seq for vb 5 > 0 (3). Retrying
2017-10-11T18:12:01.746-07:00 WARNING: RetryLoop for Wait for vb sequence numbers to be 0 giving up after 13 attempts -- base.RetryLoop() at util.go:298
TestStopChangeCache
appears to be leaking changeListeners. If run in a loop, it consistently fails the second time.
Changing:
//// Use leaky bucket to have the tap feed 'lose' document 3
leakyConfig := base.LeakyBucketConfig{
TapFeedMissingDocs: []string{"doc-3"},
}
db := setupTestLeakyDBWithCacheOptions(t, cacheOptions, leakyConfig)
to
db := setupTestDBWithCacheOptions(t, cacheOptions)
Fixes the issue
Failing stacktrace:
goroutine 33 [chan receive]:
github.com/couchbase/sync_gateway/base.(*LeakyBucket).wrapFeed.func1(0x905380, 0xc42038d550, 0xc42038d540, 0xc420386220)
/Users/tleyden/Development/sync_gateway/godeps/src/github.com/couchbase/sync_gateway/base/leaky_bucket.go:211 +0xab
created by github.com/couchbase/sync_gateway/base.(*LeakyBucket).wrapFeed
/Users/tleyden/Development/sync_gateway/godeps/src/github.com/couchbase/sync_gateway/base/leaky_bucket.go:217 +0x126
goroutine 114 [chan receive]:
github.com/couchbase/sync_gateway/db.(*changeListener).StartMutationFeed.func1(0xc4204620b0)
/Users/tleyden/Development/sync_gateway/godeps/src/github.com/couchbase/sync_gateway/db/change_listener.go:82 +0xc7
created by github.com/couchbase/sync_gateway/db.(*changeListener).StartMutationFeed
/Users/tleyden/Development/sync_gateway/godeps/src/github.com/couchbase/sync_gateway/db/change_listener.go:85 +0x2f8
Logs from last run: https://gist.github.com/tleyden/5f4904220189e49de6a344a516988e16
In several places it's detecting residual changeListener goroutines:
--- PASS: TestPurgeWithGreaterThanOneRevision (0.57s)
=== RUN TestPurgeWithNonStarRevision
2017-10-12 21:21:45.454392 I | bucketDataSource.Close() called
2017-10-12T21:21:55.736-07:00 WARNING: RetryLoop for Retry StackTraceContainsPatterns giving up after 11 attempts -- base.RetryLoop() at util.go:298
2017-10-12 21:21:55.741769 I | Stacktrace with unexpected pattern: changeListener. Stacktrace: goroutine 4380 [running]:
github.com/couchbase/sync_gateway/base.AssertStackTraceDoesntContainPatterns(0xc4200a2540, 0xc42015acd0, 0x1, 0x1)
/Users/tleyden/Development/sync_gateway/godeps/src/github.com/couchbase/sync_gateway/base/util_testing.go:566 +0xd5
github.com/couchbase/sync_gateway/base.AssertStackTraceDoesntContainProblematicPatterns(0xc4200a2540)
/Users/tleyden/Development/sync_gateway/godeps/src/github.com/couchbase/sync_gateway/base/util_testing.go:556 +0x4f
github.com/couchbase/sync_gateway/rest.TestPurgeWithNonStarRevision(0xc4200a2540)
/Users/tleyden/Development/sync_gateway/godeps/src/github.com/couchbase/sync_gateway/rest/admin_api_test.go:1138 +0x40
testing.tRunner(0xc4200a2540, 0x956988)
I just noticed this timestamp anomaly when running:
$ SG_TEST_BACKING_STORE=Couchbase SG_TEST_COUCHBASE_SERVER_URL=http://localhost:8091 SG_TEST_USE_XATTRS=false ./test.sh -run ^TestShadowerPatternRepeat$
Logs: note that the timestamp goes backwards from 2017-10-16T14:57:54.351-07:00
-> 2017-10-16T14:55:00.824-07:00
--- PASS: TestShadowerPattern (1.31s)
2017-10-16 14:57:54.351853 I | bucketDataSource.Close() called
2017-10-16 14:57:54.351890 I | bucketDataSource.Close() called
2017-10-16T14:57:54.351-07:00 WARNING: Error processing DCP stream - will attempt to restart/reconnect: pkt.Receive, err: read tcp 127.0.0.1:63612->127.0.0.1:11210: use of closed network connection. Receiver ID: 60 -- base.(*DCPReceiver).OnError() at dcp_feed.go:128
2017-10-16T14:57:54.351-07:00 WARNING: Error processing DCP stream - will attempt to restart/reconnect: pkt.Receive, err: read tcp 127.0.0.1:63611->127.0.0.1:11210: use of closed network connection. Receiver ID: 59 -- base.(*DCPReceiver).OnError() at dcp_feed.go:128
PASS
ok github.com/couchbase/sync_gateway/db 174.584s
https://hub.docker.com/r/couchbase/sync-gateway/~/dockerfile/
=== RUN TestUserAPI
2017-10-16T14:55:00.824-07:00 NOT Using xattrs: strings.ToLower(useXattrs) != strings.ToLower(TestEnvSyncGatewayTrue). |false| != |true|
2017-10-16 14:55:00.828210 I | Flushing test bucket: test_data_bucket
2017-10-16 14:55:00.828235 I | BucketManager.Flush() called. bm: 0xc42033c900
2017-10-16 14:55:01.260284 I | workerReadOwnWrites completed without errors
2017-10-16T14:55:01.260-07:00 GoCBCustomSGTranscoder Opening Couchbase database test_data_bucket on <http://localhost:8091> as user "test_data_bucket"
2017-10-16T14:55:01.262-07:00 NOT Using xattrs: strings.ToLower(useXattrs) != strings.ToLower(TestEnvSyncGatewayTrue). |false| != |true|
2017-10-16T14:55:01.262-07:00 NOT Using xattrs: strings.ToLower(useXattrs) != strings.ToLower(TestEnvSyncGatewayTrue). |false| != |true|
2017-10-16T14:55:01.262-07:00 Opening db /db as bucket "test_data_bucket", pool "default", server <http://localhost:8091>
I posted to golang-nuts asking for help. But this would explain a lot of mysterious failures when trying to run the full test suite, as well as the fact that it appears to work when running it on a package-by-package basis.
The fix: when running tests in integration mode, it will need to pass the -p 1
flag.
Latest failure, even with -p 1
flag:
=== RUN TestChangeIndexGetChanges
2017-10-16 16:45:11.330209 I | Flushing test bucket: test_data_bucket
2017-10-16 16:45:11.330236 I | BucketManager.Flush() called. bm: 0xc420b82420
2017-10-16 16:45:11.796675 I | workerReadOwnWrites completed without errors
2017-10-16 16:45:11.799693 I | Flushing test bucket: test_indexbucket
2017-10-16 16:45:11.799709 I | BucketManager.Flush() called. bm: 0xc420b839b0
2017-10-16 16:45:12.390583 I | workerReadOwnWrites completed without errors
panic: runtime error: index out of range
goroutine 6128 [running]:
panic(0x860d20, 0xc420016130)
/usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/couchbaselabs/sync-gateway-accel/indexwriter.ReorderPending(0xc4200115c0, 0xc4200117a0, 0xc420390040)
/Users/tleyden/Development/sync_gateway/godeps/src/github.com/couchbaselabs/sync-gateway-accel/indexwriter/kv_change_index_writer.go:535 +0x387
created by github.com/couchbaselabs/sync-gateway-accel/indexwriter.(*KvChangeIndexWriter).Init
/Users/tleyden/Development/sync_gateway/godeps/src/github.com/couchbaselabs/sync-gateway-accel/indexwriter/kv_change_index_writer.go:114 +0x174
exit status 2
FAIL github.com/couchbaselabs/sync-gateway-accel/indexwriter 108.005s
? github.com/couchbaselabs/sync-gateway-accel/sg_accel [no test files]
Unexpected vbnos: https://gist.github.com/tleyden/16f050051ddcff7b0f5795aee41e4a47 Not sure what's causing this
Currently the integration tests in
integration_tests.sh
are run one-at-a-time, to workaround mysterious sporadic failures.However, this is causing issues with the code coverage: https://github.com/couchbase/sync_gateway/issues/2937#issuecomment-332666641, and so it should be possible to run the integration tests as a full suite rather than one test at a time.