Run integration tests as full suite

tleyden commented 7 years ago

Currently the integration tests in integration_tests.sh are run one-at-a-time, to workaround mysterious sporadic failures.

However, this is causing issues with the code coverage: https://github.com/couchbase/sync_gateway/issues/2937#issuecomment-332666641, and so it should be possible to run the integration tests as a full suite rather than one test at a time.

tleyden commented 7 years ago

TestChannelAccessChanges fails with: https://gist.github.com/tleyden/1ffa96ff109df5c4dd30b53f45da2198

tleyden commented 7 years ago

TestContinuousChangesBackfill fails with: https://gist.github.com/tleyden/e45c37520501ea23afc6ed8ab55aeb04

vs successful logs:

https://gist.github.com/tleyden/83f73db80f02441d3742accbc510951d

I repro'd this by running a single test against couchbase server:

https://gist.github.com/tleyden/5a0c69da17e4641927fc4d81f2a7d431

This was couchbase 5 under docker. I ended up switching to couchbase 4 running locally.

tleyden commented 7 years ago

TestAttachmentForRejectedDocument failing locally against couchbase 4.1.

https://gist.github.com/tleyden/b1636309ae542500beefa72b9c6a287c

tleyden commented 7 years ago

Is DCP close() async? Then we we flush the bucket we get dcp rollback.

tleyden commented 7 years ago

Maybe no defer close for bucket?

tleyden commented 7 years ago

Latest failure logs: https://gist.github.com/tleyden/d5e24af38169f11e41ace05a0d7d1d0f

=== RUN   TestFilterToAvailableSince/RoleAndRoleChannelGrantAfterSince_ChannelGrantFirst
2017-10-03 13:58:56.802800 I | NOT Using xattrs: strings.ToLower(useXattrs) != strings.ToLower(TestEnvSyncGatewayTrue).  |false| != |true|
2017-10-03 13:58:56.804545 I | Flushing test bucket: test_data_bucket
2017-10-03 13:58:57.410726 I | workerReadOwnWrites completed without errors
2017-10-03T13:58:57.410-07:00 GoCBCustomSGTranscoder Opening Couchbase database test_data_bucket on <http://localhost:8091> as user "test_data_bucket"

    auth_test.go:549
    assert.Equals(t, err, nil)

    expected: <nil>
         got: &gocbcore.KvError{Code:0x86, Name:"", Description:"temporary failure occurred, try again later", Context:"", Ref:""}
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
    panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x98 pc=0x8cb62]

goroutine 795 [running]:
panic(0x492700, 0xc4200160f0)
    /usr/local/go/src/runtime/panic.go:500 +0x1a1
testing.tRunner.func1(0xc4200a4540)
    /usr/local/go/src/testing/testing.go:579 +0x25d
panic(0x492700, 0xc4200160f0)
    /usr/local/go/src/runtime/panic.go:458 +0x243
github.com/couchbase/sync_gateway/auth.TestFilterToAvailableSince.func1(0xc4200a4540)
    /Users/tleyden/Development/sync_gateway/godeps/src/github.com/couchbase/sync_gateway/auth/auth_test.go:554 +0x492
testing.tRunner(0xc4200a4540, 0xc420536f60)
    /usr/local/go/src/testing/testing.go:610 +0x81
created by testing.(*T).Run
    /usr/local/go/src/testing/testing.go:646 +0x2ec
FAIL    github.com/couchbase/sync_gateway/auth  46.532s

tleyden commented 7 years ago

Created separate ticket for adding additional retry handling: https://github.com/couchbase/sync_gateway/issues/2949

tleyden commented 7 years ago

Another failure from https://gist.github.com/tleyden/d5e24af38169f11e41ace05a0d7d1d0f

2017-10-03T14:06:28.760-07:00 Changes+: Int sequence multi changes feed...
2017-10-03T14:06:28.760-07:00 Changes: MultiChangesFeed(channels: {*}, options: {Since:0 Limit:0 Conflicts:false IncludeDocs:false Wait:false Continuous:false Terminator:0xc42268f3e0 HeartbeatMs:0 TimeoutMs:300000 ActiveOnly:false}) ...   (to zegpold)
2017-10-03T14:06:28.760-07:00 Changes: simple changes cannot get Close Notifier from ResponseWriter
2017-10-03T14:06:28.760-07:00 Changes+: MultiChangesFeed: channels expand to "!:1,alpha:9,gamma:4,zero:1" ...   (to zegpold)
2017-10-03T14:06:28.760-07:00 Cache: getCachedChanges("gamma", 4:0) --> 1 changes valid from #1
2017-10-03T14:06:28.760-07:00 Changes+: [changesFeed] Found 1 changes for channel gamma
2017-10-03T14:06:28.760-07:00 Cache: getCachedChanges("zero", 0) --> 0 changes valid from #1
2017-10-03T14:06:28.760-07:00 Changes+: [changesFeed] Found 0 changes for channel zero
2017-10-03T14:06:28.760-07:00 Cache: getCachedChanges("!", 0) --> 0 changes valid from #1
2017-10-03T14:06:28.760-07:00 Changes+: [changesFeed] Found 0 changes for channel !
2017-10-03T14:06:28.760-07:00 Changes+: Grant for channel [alpha] is after the current sequence - skipped for this iteration.  Grant:[9] Current:[8]   (to zegpold)
2017-10-03T14:06:28.760-07:00 Changes+: Channel feed processing seq:8 in channel gamma   (to zegpold)
2017-10-03T14:06:28.760-07:00 Changes+: MultiChangesFeed sending {Seq:8, ID:g1, Changes:[map[rev:1-97023f8d0f6031c1587c483ff7f5a2ac]]}   (to zegpold)
2017-10-03T14:06:28.760-07:00 Changes: MultiChangesFeed done   (to zegpold)
2017-10-03 14:06:28.760771 I | //////// _changes for zegpold looks like: {"results":[
{"seq":8,"id":"g1","changes":[{"rev":"1-97023f8d0f6031c1587c483ff7f5a2ac"}]}
],
"last_seq":"8"}

    api_test.go:1682
    assert.Equals(t, len(changes.Results), 2)

    expected: 2
         got: 1
2017-10-03 14:06:28.763531 I | DatabaseContext.Close() called

It looks like there needs to be additional "wait for changes" retry loops added to the test.

tleyden commented 7 years ago

Regarding previous comment, I've added more "wait for changes" to that test: 5c46825d7bc0904da7fd8616622b00bdcf9b4715

Another test failure: https://gist.github.com/tleyden/814fcad420a21afcc5d26a6a5f5a134c

--- FAIL: TestContinuousChangesBackfill (4.34s)
panic: changeCache: Sequence 13 never showed up! [recovered]
    panic: changeCache: Sequence 13 never showed up!

goroutine 805 [running]:
panic(0x55ee40, 0xc420204df0)
    /usr/local/go/src/runtime/panic.go:500 +0x1a1
testing.tRunner.func1(0xc42027e840)
    /usr/local/go/src/testing/testing.go:579 +0x25d
panic(0x55ee40, 0xc420204df0)
    /usr/local/go/src/runtime/panic.go:458 +0x243
github.com/couchbase/sync_gateway/db.(*changeCache).waitForSequence(0xc4202e8e10, 0xd)
    /Users/tleyden/Development/sync_gateway/godeps/src/github.com/couchbase/sync_gateway/db/change_cache.go:289 +0x192
github.com/couchbase/sync_gateway/db.(*changeCache).waitForSequenceID(0xc4202e8e10, 0x0, 0x0, 0x0, 0xd, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /Users/tleyden/Development/sync_gateway/godeps/src/github.com/couchbase/sync_gateway/db/change_cache.go:274 +0x35
github.com/couchbase/sync_gateway/db.TestContinuousChangesBackfill(0xc42027e840)
    /Users/tleyden/Development/sync_gateway/godeps/src/github.com/couchbase/sync_gateway/db/change_cache_test.go:430 +0x8d6
testing.tRunner(0xc42027e840, 0x6d9770)
    /usr/local/go/src/testing/testing.go:610 +0x81
created by testing.(*T).Run
    /usr/local/go/src/testing/testing.go:646 +0x2ec
FAIL    github.com/couchbase/sync_gateway/db    10.108s

tleyden commented 7 years ago

re:

Is DCP close() async? Then we we flush the bucket we get dcp rollback.

I saw this in the cbdatasource.go code:

// TODO: By this point, worker goroutines may still be going, but // should end soon. Instead Close() should be 100% synchronous. return nil

Looks like in the current master branch it hasn't changed:

https://github.com/couchbase/go-couchbase/blob/master/cbdatasource/cbdatasource.go#L1677-L1679

tleyden commented 7 years ago

I filed https://issues.couchbase.com/browse/MB-26262 to see if the cbdatasource maintainers think that's an issue.

tleyden commented 7 years ago

Latest failure: https://gist.github.com/tleyden/daa6bdb9c4320dc9157cb0bfa2fdf4e0

tleyden commented 7 years ago

This failure demonstrates how the dcp errors from the previous test happen during the test cleanup of following test:

https://gist.github.com/tleyden/b046d9a6780adce6e574cfa9680db273

tleyden commented 7 years ago

In 1c2088e84ab344ee946b5dd7bd431d81282487f0 I added a dcp receiver id to try to pinpoint which dcp receiver was getting the callbacks.

From https://gist.github.com/tleyden/c4077efc194ab67fa0a86a30f89b49d3 it's clear that it's the dcp receiver from the previous test getting the dcp rollback callbacks.

Also, I'm able to repro it quickly + reliably using TestReproduceTestResidue()

tleyden commented 7 years ago

These results demonstrate the interleaving of the dcp receivers: https://gist.github.com/tleyden/1006df094374f6d218f61ff859f11667#file-gistfile1-txt-L355-L358

tleyden commented 7 years ago

It looks like cbdatasource.Close() is never even getting called during TestReproduceTestResidue ()

Working backwards up the stack:

In dcp_feed.go#StartDCPFeed(), args.Terminator is null and so it does not start the goroutine which listens for the channel to be closed and calls cbdatasource.Close()
In change_listener.go#Start(), it creates a sgbucket.FeedArguments struct with no Terminator, which causes it to be nil in dcp_feed.go#StartDCPFeed()

tleyden commented 7 years ago

Still seeing failures even after the fix for #2952: https://gist.github.com/tleyden/4789cdac626589f3c7b3ccdeea6714ba

tleyden commented 7 years ago

Still seeing dcp receiver callback residue. I believe some of the tests are missing cleanup.

screen shot 2017-10-04 at 3 15 31 pm

tleyden commented 7 years ago

I ran the full suite on the jenkins slave on the cli, and ran into a failure with: TestLowSequenceHandling

https://gist.github.com/tleyden/9c48a73971f2a66abef125797414b784#file-gistfile1-txt-L95-L101

Looks like this also needs the "changes feed retry loop"

tleyden commented 7 years ago

The last test run failed with logs:

2017-10-05T11:02:54.006-07:00 WARNING: Error flushing bucket: {"_":"Flush failed with unexpected error. Check server logs for details."}  Will retry. -- base.(*TestBucketManager).EmptyTestBucket.func1() at util_testing.go:287
2017-10-05T11:02:57.910-07:00 WARNING: Error inserting key to recently flushed bucket: temporary failure occurred, try again later  Retrying. -- base.(*TestBucketManager).EmptyTestBucket.func3() at util_testing.go:370
2017-10-05T11:02:58.010-07:00 WARNING: Error inserting key to recently flushed bucket: temporary failure occurred, try again later  Retrying. -- base.(*TestBucketManager).EmptyTestBucket.func3() at util_testing.go:370
2017-10-05T11:02:58.211-07:00 WARNING: Error inserting key to recently flushed bucket: temporary failure occurred, try again later  Retrying. -- base.(*TestBucketManager).EmptyTestBucket.func3() at util_testing.go:370
2017-10-05T11:02:58.612-07:00 WARNING: Error inserting key to recently flushed bucket: temporary failure occurred, try again later  Retrying. -- base.(*TestBucketManager).EmptyTestBucket.func3() at util_testing.go:370
2017-10-05 11:02:59.413392 I | workerReadOwnWrites completed without errors
2017-10-05T11:02:59.415-07:00 GoCBCustomSGTranscoder Opening Couchbase database test_data_bucket on <http://integration-cbs.sc.couchbase.com:8091> as user "test_data_bucket"
2017-10-05T11:03:04.561-07:00 WARNING: RetryLoop for CouchbaseBucketGoCB Set() giving up after 11 attempts -- base.RetryLoop() at util.go:298

I'm doing a cbcollect info to try to find out more information on why these temporary failures are happening.

tleyden commented 7 years ago

From @jamesmauss in support:

looks like you were hitting https://issues.couchbase.com/browse/MB-25803 fixed in releases after 3505

I'm upgrading to a more recent version of couchbase server

tleyden commented 7 years ago

I installed couchbase server locally on the jenkins slave where the integration tests were running, to see if it made any difference.

Here are the latest failures:

https://gist.github.com/tleyden/898275de7a946e986de249435aac373d#file-gistfile1-txt-L247-L292

I also ran a cbcollectinfo to see if it reveals anything.

tleyden commented 7 years ago

From these logs it looks like sg-replicate replicators are started in some tests and never shut down:

https://gist.github.com/tleyden/1de5f235f5163acba06fab7fb58a8de4

tleyden commented 7 years ago

While running locally I saw the following failure:

https://gist.github.com/tleyden/f65bd202f4e79503da3308e02e552d19

From the dump of goroutines, it looks like gocb connections aren't being cleaned up, since the dump contains thousands of stacks that contain gocb. That might explain some of the temporary failures.

goroutine 3075 [select, 277 minutes]:
net/http.(*persistConn).writeLoop(0xc4205e8400)
    /usr/local/go/src/net/http/transport.go:1646 +0x3bd
created by net/http.(*Transport).dialConn
    /usr/local/go/src/net/http/transport.go:1063 +0x50e

goroutine 3512 [chan receive, 277 minutes]:
gopkg.in/couchbase/gocbcore%2ev7.(*memdPipelineClient).ioLoop.func1(0xc42034aca0, 0xf, 0xc4204b4500, 0xc420b40190, 0xc420d80000)
    /Users/tleyden/Development/sync_gateway/godeps/src/gopkg.in/couchbase/gocbcore.v7/memdpipeline.go:192 +0x129
created by gopkg.in/couchbase/gocbcore%2ev7.(*memdPipelineClient).ioLoop
    /Users/tleyden/Development/sync_gateway/godeps/src/gopkg.in/couchbase/gocbcore.v7/memdpipeline.go:205 +0xfb

goroutine 3069 [GC assist wait]:
gopkg.in/couchbase/gocbcore%2ev7.(*memdTcpConn).ReadPacket(0xc420688370, 0xc420ba5030, 0xf, 0x2)
    /Users/tleyden/Development/sync_gateway/godeps/src/gopkg.in/couchbase/gocbcore.v7/memdconn.go:153 +0xd5
gopkg.in/couchbase/gocbcore%2ev7.(*memdClient).run.func2(0xc4206883c0, 0xc4203b42a0, 0xc4203ba540)
    /Users/tleyden/Development/sync_gateway/godeps/src/gopkg.in/couchbase/gocbcore.v7/memdclient.go:147 +0xdf
created by gopkg.in/couchbase/gocbcore%2ev7.(*memdClient).run
    /Users/tleyden/Development/sync_gateway/godeps/src/gopkg.in/couchbase/gocbcore.v7/memdclient.go:198 +0xce

I'm going to rerun and try to keep my eye on the total number of server connections reported by couchbase.

tleyden commented 7 years ago

Result of running TestReproduceTestResidue and dumping goroutines:

https://gist.github.com/tleyden/d5b92053b98f12dbe0153959d20061d4

tleyden commented 7 years ago

It looks like a lot of the resource leakage from TestReproduceTestResidue is due to the fact that this goroutine never seems to exit:

func (db *Database) SimpleMultiChangesFeed(chans base.Set, options ChangesOptions) (<-chan *ChangeEntry, error) {
    to := ""
    if db.user != nil && db.user.Name() != "" {
        to = fmt.Sprintf("  (to %s)", db.user.Name())
    }

    base.LogTo("Changes", "MultiChangesFeed(channels: %s, options: %+v) ... %s", chans, options, to)
    output := make(chan *ChangeEntry, 50)

    go func() {
        defer func() {
            base.LogTo("Changes", "MultiChangesFeed done %s", to)
            close(output)
        }()

I'm not seeing "MultiChangesFeed done" in the logs anywhere.

tleyden commented 7 years ago

After c00b11ac4fffca1af01b8a4e7916ac6e118dafdd, now the change_listener.go:203 lines are no longer accumulating in the goroutine stack dumps, and "MultiChangesFeed done" is in the logs. Consider this change a "proof of concept" to open a discussion, not necessarily the best approach to fix this.

New logs: https://gist.github.com/tleyden/d4825dcb31d55421307a031f286bf1a2

tleyden commented 7 years ago

I'm noticing that change_cache.go goroutine background tasks are hanging around longer than necessary. From https://gist.github.com/tleyden/d4825dcb31d55421307a031f286bf1a2

goroutine 181 [sleep]:
time.Sleep(0xdf8475800)
    /usr/local/go/src/runtime/time.go:59 +0xe1
github.com/couchbase/sync_gateway/db.(*changeCache).Init.func2(0xc4200dc2d0)
    /Users/tleyden/Development/sync_gateway/godeps/src/github.com/couchbase/sync_gateway/db/change_cache.go:126 +0x43
created by github.com/couchbase/sync_gateway/db.(*changeCache).Init
    /Users/tleyden/Development/sync_gateway/godeps/src/github.com/couchbase/sync_gateway/db/change_cache.go:130 +0x285

tleyden commented 7 years ago

I cherry picked the commits from #2962 onto the branch for this issue. After those changes, plus the changes mentioned in https://github.com/couchbase/sync_gateway/issues/2938#issuecomment-334879394, all of the goroutines hanging around after running SG_TEST_BACKING_STORE=Couchbase SG_TEST_COUCHBASE_SERVER_URL=http://localhost:8091 SG_TEST_USE_XATTRS=false ./test.sh -run TestReproduceTestResidue are basically gocb related:

https://gist.github.com/tleyden/ef4ce9ed67dee5c12a745de8b565842d#file-gistfile1-txt-L3416-L4663

tleyden commented 7 years ago

Filed https://issues.couchbase.com/browse/GOCBC-236

tleyden commented 7 years ago

After the fix for https://github.com/couchbase/sync_gateway/issues/2965, the full suite runs for me locally when I run it package by package:

echo "test-rest"; SG_TEST_BACKING_STORE=Couchbase SG_TEST_COUCHBASE_SERVER_URL=http://localhost:8091 SG_TEST_USE_XATTRS=false go test -timeout 180m -v github.com/couchbase/sync_gateway/rest && echo "test-auth"; SG_TEST_BACKING_STORE=Couchbase SG_TEST_COUCHBASE_SERVER_URL=http://localhost:8091 SG_TEST_USE_XATTRS=false go test -timeout 180m -v github.com/couchbase/sync_gateway/auth && echo "test-base"; SG_TEST_BACKING_STORE=Couchbase SG_TEST_COUCHBASE_SERVER_URL=http://localhost:8091 SG_TEST_USE_XATTRS=false go test -timeout 180m -v github.com/couchbase/sync_gateway/base && echo "test-channels"; SG_TEST_BACKING_STORE=Couchbase SG_TEST_COUCHBASE_SERVER_URL=http://localhost:8091 SG_TEST_USE_XATTRS=false go test -timeout 180m -v github.com/couchbase/sync_gateway/channels && echo "test-db"; SG_TEST_BACKING_STORE=Couchbase SG_TEST_COUCHBASE_SERVER_URL=http://localhost:8091 SG_TEST_USE_XATTRS=false go test -timeout 180m -v github.com/couchbase/sync_gateway/db

However when I run via SG_TEST_BACKING_STORE=Couchbase SG_TEST_COUCHBASE_SERVER_URL=http://localhost:8091 SG_TEST_USE_XATTRS=false ./test.sh, I'm seeing failures: https://gist.github.com/4b76950411f0f554a03452d00b4ec43d. Still investigating.

tleyden commented 7 years ago

After cleaning up quite a few places where buckets were left open, I'm still seeing dcp rollback errors and some failing tests:

https://gist.github.com/tleyden/f4ff3a3c2e95a8708559be7de5c174d2

tleyden commented 7 years ago

Another rollback related failure:

SG_TEST_BACKING_STORE=Couchbase SG_TEST_COUCHBASE_SERVER_URL=http://localhost:8091 SG_TEST_USE_XATTRS=false ./test.sh

https://gist.github.com/tleyden/82b3c5184d6724dd4210bd9c598a0b71

tleyden commented 7 years ago

On a jenkins slave, it failed trying to flush the bucket and read it's own writes:

https://gist.github.com/tleyden/58df24fd97b387db63f4bbad3b92d0d3

tleyden commented 7 years ago

Logs with more dcp logging enabled: https://gist.github.com/tleyden/d0038d3103b95bb8bba4a791819a09d3

Log snippet:

2017-10-11T15:23:22.811-07:00 WARNING: Using DCP Logging Receiver.  Receiver: 11 -- base.NewDCPReceiver() at dcp_feed.go:112
2017-10-11 15:23:22.811425 I | Starting dcp feed.  args.Backfill: 18446744073709551615
2017-10-11 15:23:22.811430 I | For non-backfill, use vbucket uuids, high sequence numbers
2017-10-11 15:23:22.812057 I | Seeding seqnos: map[59:2 32:1]
....
2017-10-11T15:23:23.143-07:00 WARNING: Got unrecoverable error <nil> trying to get key _sync:user:naomi.  Not going to retry. -- base.CouchbaseBucketGoCB.Get.func2() at bucket_gocb.go:303
2017-10-11T15:23:23.143-07:00 WARNING: Got unrecoverable error <nil> trying to get key _sync:user:naomi.  Not going to retry. -- base.CouchbaseBucketGoCB.Get.func2() at bucket_gocb.go:303
2017-10-11T15:23:23.737-07:00 WARNING: DCP RollbackEx request - rolling back DCP feed for: vbucketId: 59, rollbackSeq: 0.  Receiver: 11 -- base.(*DCPReceiver).RollbackEx() at dcp_feed.go:254
2017-10-11T15:23:23.742-07:00 WARNING: DCP RollbackEx request - rolling back DCP feed for: vbucketId: 49, rollbackSeq: 0.  Receiver: 11 -- base.(*DCPReceiver).RollbackEx() at dcp_feed.go:254
2017-10-11T15:23:23.747-07:00 WARNING: DCP RollbackEx request - rolling back DCP feed for: vbucketId: 32, rollbackSeq: 0.  Receiver: 11 -- base.(*DCPReceiver).RollbackEx() at dcp_feed.go:254
2017-10-11T15:23:23.766-07:00 WARNING: DCP RollbackEx request - rolling back DCP feed for: vbucketId: 8, rollbackSeq: 0.  Receiver: 11 -- base.(*DCPReceiver).RollbackEx() at dcp_feed.go:254
2017-10-11T15:23:23.768-07:00 WARNING: DCP RollbackEx request - rolling back DCP feed for: vbucketId: 5, rollbackSeq: 0.  Receiver: 11 -- base.(*DCPReceiver).RollbackEx() at dcp_feed.go:254

It's interesting that it gets a rolling back DCP feed for: vbucketId: 49, rollbackSeq: 0 message, even though the seqnos map didn't include vbucket 49

another set of logs from a jenkins slave: https://gist.github.com/tleyden/650120c3f9f57acf2de28003e6ef773c

tleyden commented 7 years ago

I ran a wireshark capture during this run: https://gist.github.com/tleyden/54b5028c57ff22d873a182a1d14fdbe2

The main thing that jumped out at me was that the response for the vbucket seq numbers is showing some very high numbers:

and I was expecting these to be 0, or see a retry loop until they are 0.

Also I found some code that I'd added to flush() the bucket within that loop to wait until the vb seq numbers are 0, and I think it's better to remove that.

        _, highSeqnos, err := GetStatsVbSeqno(stats, maxVbno, false)

        if err != nil {
            return false, err, nil
        }

        for vbNo := uint16(0); vbNo < maxVbno; vbNo++ {
            maxSeqForVb := highSeqnos[vbNo]
            if maxSeqForVb > 0 {
                log.Printf("max seq for vb %d > 0 (%d).  Flusing and retrying", vbNo, maxSeqForVb)
                err = tbm.BucketManager.Flush()
                if err != nil {
                    Warn("Error flushing bucket: %v", err)
                }

                return true, nil, nil
            }
        }

tleyden commented 7 years ago

@adamcfraser when calling GetStatsVbSeqno(), it's currently passing useAbsHighSeqNo as false. Should it be passing true? I'm having trouble remembering what that means, and can't find any doc comments on it.

adamcfraser commented 7 years ago

That flag determines whether we use the vb_nn:abs_high_seqno or vb_nn:high_seqno property from the stats response. Originally added w/ https://github.com/couchbase/sync_gateway/commit/3c3a0739afe402565ce17893bc5dddd7ea4a5a32

You could dig into the history there, but if you were using the same stats call to validate the flush (as mentioned above), why is it returning zeros in one case and not the other?

tleyden commented 7 years ago

Another run after making some fixes -- one which made it stop reusing the (stateful) sleeper function, which looks like it was preventing certain retries from happening when they should have:

The most salient aspect of this run is that I'm seeing an unexpected bucket flush happening during the test run of the TestUserDeleteDuringChangesWithAccess test that starts at 2017-10-11 17:19:56.160616:

tleyden commented 7 years ago

In the last run there seems to be signs that test functionality from other tests are bleeding into the failing test.

Logs (view full logs and search for TestPurgeWithNonStarRevision)
Packet capture

In frame 132415 of the capture, there is a doc mutation on doc-5, which overlaps with this in the logs:

2017-10-11 18:11:20.745273 I | max seq for vb 59 > 0 (2).  Retrying
2017-10-11 18:11:20.756700 I | max seq for vb 59 > 0 (2).  Retrying
2017-10-11 18:11:20.782237 I | max seq for vb 59 > 0 (2).  Retrying
2017-10-11 18:11:20.825898 I | max seq for vb 59 > 0 (2).  Retrying
2017-10-11 18:11:20.909486 I | max seq for vb 20 > 0 (1).  Retrying
2017-10-11 18:11:21.073835 I | max seq for vb 20 > 0 (1).  Retrying
2017-10-11 18:11:21.399066 I | max seq for vb 20 > 0 (1).  Retrying
2017-10-11 18:11:22.043657 I | max seq for vb 48 > 0 (1).  Retrying
2017-10-11 18:11:23.327073 I | max seq for vb 48 > 0 (1).  Retrying
2017-10-11 18:11:25.891282 I | max seq for vb 5 > 0 (1).  Retrying
2017-10-11 18:11:31.016160 I | max seq for vb 59 > 0 (2).  Retrying
2017-10-11 18:11:41.260437 I | max seq for vb 59 > 0 (2).  Retrying
2017-10-11 18:12:01.746042 I | max seq for vb 5 > 0 (3).  Retrying
2017-10-11T18:12:01.746-07:00 WARNING: RetryLoop for Wait for vb sequence numbers to be 0 giving up after 13 attempts -- base.RetryLoop() at util.go:298

tleyden commented 7 years ago

TestStopChangeCache appears to be leaking changeListeners. If run in a loop, it consistently fails the second time.

Changing:

    //// Use leaky bucket to have the tap feed 'lose' document 3
    leakyConfig := base.LeakyBucketConfig{
        TapFeedMissingDocs: []string{"doc-3"},
    }
    db := setupTestLeakyDBWithCacheOptions(t, cacheOptions, leakyConfig)

to


    db := setupTestDBWithCacheOptions(t, cacheOptions)

Fixes the issue

Failing stacktrace:

goroutine 33 [chan receive]:
github.com/couchbase/sync_gateway/base.(*LeakyBucket).wrapFeed.func1(0x905380, 0xc42038d550, 0xc42038d540, 0xc420386220)
    /Users/tleyden/Development/sync_gateway/godeps/src/github.com/couchbase/sync_gateway/base/leaky_bucket.go:211 +0xab
created by github.com/couchbase/sync_gateway/base.(*LeakyBucket).wrapFeed
    /Users/tleyden/Development/sync_gateway/godeps/src/github.com/couchbase/sync_gateway/base/leaky_bucket.go:217 +0x126

goroutine 114 [chan receive]:
github.com/couchbase/sync_gateway/db.(*changeListener).StartMutationFeed.func1(0xc4204620b0)
    /Users/tleyden/Development/sync_gateway/godeps/src/github.com/couchbase/sync_gateway/db/change_listener.go:82 +0xc7
created by github.com/couchbase/sync_gateway/db.(*changeListener).StartMutationFeed
    /Users/tleyden/Development/sync_gateway/godeps/src/github.com/couchbase/sync_gateway/db/change_listener.go:85 +0x2f8

tleyden commented 7 years ago

Logs from last run: https://gist.github.com/tleyden/5f4904220189e49de6a344a516988e16

In several places it's detecting residual changeListener goroutines:

--- PASS: TestPurgeWithGreaterThanOneRevision (0.57s)
=== RUN   TestPurgeWithNonStarRevision
2017-10-12 21:21:45.454392 I | bucketDataSource.Close() called
2017-10-12T21:21:55.736-07:00 WARNING: RetryLoop for Retry StackTraceContainsPatterns giving up after 11 attempts -- base.RetryLoop() at util.go:298
2017-10-12 21:21:55.741769 I | Stacktrace with unexpected pattern: changeListener.  Stacktrace: goroutine 4380 [running]:
github.com/couchbase/sync_gateway/base.AssertStackTraceDoesntContainPatterns(0xc4200a2540, 0xc42015acd0, 0x1, 0x1)
    /Users/tleyden/Development/sync_gateway/godeps/src/github.com/couchbase/sync_gateway/base/util_testing.go:566 +0xd5
github.com/couchbase/sync_gateway/base.AssertStackTraceDoesntContainProblematicPatterns(0xc4200a2540)
    /Users/tleyden/Development/sync_gateway/godeps/src/github.com/couchbase/sync_gateway/base/util_testing.go:556 +0x4f
github.com/couchbase/sync_gateway/rest.TestPurgeWithNonStarRevision(0xc4200a2540)
    /Users/tleyden/Development/sync_gateway/godeps/src/github.com/couchbase/sync_gateway/rest/admin_api_test.go:1138 +0x40
testing.tRunner(0xc4200a2540, 0x956988)

tleyden commented 7 years ago

Timestamp anomaly

I just noticed this timestamp anomaly when running:

$ SG_TEST_BACKING_STORE=Couchbase SG_TEST_COUCHBASE_SERVER_URL=http://localhost:8091 SG_TEST_USE_XATTRS=false ./test.sh -run ^TestShadowerPatternRepeat$

Logs: note that the timestamp goes backwards from 2017-10-16T14:57:54.351-07:00 -> 2017-10-16T14:55:00.824-07:00

--- PASS: TestShadowerPattern (1.31s)
2017-10-16 14:57:54.351853 I | bucketDataSource.Close() called
2017-10-16 14:57:54.351890 I | bucketDataSource.Close() called
2017-10-16T14:57:54.351-07:00 WARNING: Error processing DCP stream - will attempt to restart/reconnect: pkt.Receive, err: read tcp 127.0.0.1:63612->127.0.0.1:11210: use of closed network connection.  Receiver ID: 60 -- base.(*DCPReceiver).OnError() at dcp_feed.go:128
2017-10-16T14:57:54.351-07:00 WARNING: Error processing DCP stream - will attempt to restart/reconnect: pkt.Receive, err: read tcp 127.0.0.1:63611->127.0.0.1:11210: use of closed network connection.  Receiver ID: 59 -- base.(*DCPReceiver).OnError() at dcp_feed.go:128
PASS
ok      github.com/couchbase/sync_gateway/db    174.584s

https://hub.docker.com/r/couchbase/sync-gateway/~/dockerfile/
=== RUN   TestUserAPI
2017-10-16T14:55:00.824-07:00 NOT Using xattrs: strings.ToLower(useXattrs) != strings.ToLower(TestEnvSyncGatewayTrue).  |false| != |true|
2017-10-16 14:55:00.828210 I | Flushing test bucket: test_data_bucket
2017-10-16 14:55:00.828235 I | BucketManager.Flush() called.  bm: 0xc42033c900
2017-10-16 14:55:01.260284 I | workerReadOwnWrites completed without errors
2017-10-16T14:55:01.260-07:00 GoCBCustomSGTranscoder Opening Couchbase database test_data_bucket on <http://localhost:8091> as user "test_data_bucket"
2017-10-16T14:55:01.262-07:00 NOT Using xattrs: strings.ToLower(useXattrs) != strings.ToLower(TestEnvSyncGatewayTrue).  |false| != |true|
2017-10-16T14:55:01.262-07:00 NOT Using xattrs: strings.ToLower(useXattrs) != strings.ToLower(TestEnvSyncGatewayTrue).  |false| != |true|
2017-10-16T14:55:01.262-07:00 Opening db /db as bucket "test_data_bucket", pool "default", server <http://localhost:8091>

tleyden commented 7 years ago

I posted to golang-nuts asking for help. But this would explain a lot of mysterious failures when trying to run the full test suite, as well as the fact that it appears to work when running it on a package-by-package basis.

tleyden commented 7 years ago

The fix: when running tests in integration mode, it will need to pass the -p 1 flag.

tleyden commented 7 years ago

Latest failure, even with -p 1 flag:

=== RUN   TestChangeIndexGetChanges
2017-10-16 16:45:11.330209 I | Flushing test bucket: test_data_bucket
2017-10-16 16:45:11.330236 I | BucketManager.Flush() called.  bm: 0xc420b82420
2017-10-16 16:45:11.796675 I | workerReadOwnWrites completed without errors
2017-10-16 16:45:11.799693 I | Flushing test bucket: test_indexbucket
2017-10-16 16:45:11.799709 I | BucketManager.Flush() called.  bm: 0xc420b839b0
2017-10-16 16:45:12.390583 I | workerReadOwnWrites completed without errors
panic: runtime error: index out of range

goroutine 6128 [running]:
panic(0x860d20, 0xc420016130)
    /usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/couchbaselabs/sync-gateway-accel/indexwriter.ReorderPending(0xc4200115c0, 0xc4200117a0, 0xc420390040)
    /Users/tleyden/Development/sync_gateway/godeps/src/github.com/couchbaselabs/sync-gateway-accel/indexwriter/kv_change_index_writer.go:535 +0x387
created by github.com/couchbaselabs/sync-gateway-accel/indexwriter.(*KvChangeIndexWriter).Init
    /Users/tleyden/Development/sync_gateway/godeps/src/github.com/couchbaselabs/sync-gateway-accel/indexwriter/kv_change_index_writer.go:114 +0x174
exit status 2
FAIL    github.com/couchbaselabs/sync-gateway-accel/indexwriter 108.005s
?       github.com/couchbaselabs/sync-gateway-accel/sg_accel    [no test files]

tleyden commented 7 years ago

Unexpected vbnos: https://gist.github.com/tleyden/16f050051ddcff7b0f5795aee41e4a47 Not sure what's causing this

couchbase / sync_gateway

Run integration tests as full suite #2938

Timestamp anomaly