couchbase / sync_gateway

Manages access and synchronization between Couchbase Lite and Couchbase Server
https://www.couchbase.com/products/sync-gateway
Other
448 stars 138 forks source link

sg_replicate data races #1763

Closed tleyden closed 8 years ago

tleyden commented 8 years ago

In this build there was a data race detected:

12:26:26 ==================
12:26:26 WARNING: DATA RACE
12:26:26 Write by goroutine 91:
12:26:26   github.com/couchbaselabs/sg-replicate.stateFnActiveFetchCheckpoint()
12:26:26       /home/couchbase/jenkins/workspace/sgw-unix-build/1.3.0/community/godeps/src/github.com/couchbaselabs/sg-replicate/replication_state.go:50 +0x949
12:26:26   github.com/couchbaselabs/sg-replicate.(*Replication).processEvents()
12:26:26       /home/couchbase/jenkins/workspace/sgw-unix-build/1.3.0/community/godeps/src/github.com/couchbaselabs/sg-replicate/synctube.go:130 +0xa3
12:26:26 
12:26:26 Previous read by goroutine 15:
12:26:26   github.com/couchbaselabs/sg-replicate.(*Replication).GetParameters()
12:26:26       <autogenerated>:24 +0xcc
12:26:26   github.com/couchbase/sync_gateway/base.populateActiveTaskFromReplication()
12:26:26       /home/couchbase/jenkins/workspace/sgw-unix-build/1.3.0/community/godeps/src/github.com/couchbase/sync_gateway/base/replicator.go:203 +0x4c
12:26:26   github.com/couchbase/sync_gateway/base.(*Replicator).Replicate()
12:26:26       /home/couchbase/jenkins/workspace/sgw-unix-build/1.3.0/community/godeps/src/github.com/couchbase/sync_gateway/base/replicator.go:54 +0x2ec
12:26:26   github.com/couchbase/sync_gateway/rest.(*handler).handleReplicate()

Also, I enabled -race on the sg-replicate repo and it found a different data race:

2016/05/09 18:44:20 TEST: Got CANCELLED
==================
WARNING: DATA RACE
Write by goroutine 15:
 github.com/tleyden/fakehttp.(*HTTPServer).ServeHTTP()
     /drone/src/github.com/tleyden/fakehttp/fakehttp.go:108 +0x470
 net/http.serverHandler.ServeHTTP()
     /usr/local/go/src/net/http/server.go:2081 +0x206
 net/http.(*conn).serve()
     /usr/local/go/src/net/http/server.go:1472 +0x1565

Previous read by goroutine 8:
 github.com/couchbaselabs/sg-replicate.TestNoOpContinuousReplication()
     /drone/src/github.com/couchbaselabs/sg-replicate/continuous_replication_test.go:123 +0x474
 testing.tRunner()
     /usr/local/go/src/testing/testing.go:473 +0xdc

Goroutine 15 (running) created at:
 net/http.(*Server).Serve()
     /usr/local/go/src/net/http/server.go:2137 +0x4d1
 net/http.Serve()
     /usr/local/go/src/net/http/server.go:1976 +0xce
tleyden commented 8 years ago

Here is a test failure capture:

https://gist.github.com/tleyden/07b586056db42746c528d5c6c78d8d75

I'm baffled as to why the longpoll changes request doesn't return a doc.

I noticed that it the longpoll changes feed was started at the same time as the doc was added

Could this be hitting a subtle race condition?

tleyden commented 8 years ago

I should mention that the test only fails sporadically, which might be indicative of a race condition.

tleyden commented 8 years ago

I added an attempted workaround to sleep for 5 seconds before pushing the document:

    # Kick off continuous replication
    sg1.start_push_replication(
        sg2.admin.admin_url,
        DB1,
        DB2,
        continuous=True,
        use_remote_source=True,
        use_admin_url=True
    )

    # Sleep for a while -- attempt to workaround https://github.com/couchbase/sync_gateway/issues/1763#issuecomment-219901067
    time.sleep(5)

    # Add docs
    doc_id = sg1_user.add_doc()
    logging.debug("Added doc {} to sg1, waiting til it syncs on sg2".format(doc_id))

    # Wait til all docs sync to target
    wait_until_docs_sync(sg2_user, [doc_id])
    logging.debug("doc {} sync'd to sg2".format(doc_id))

Here is the capture from a successful run:

https://gist.github.com/tleyden/78b3c24b39f9ea6067248a429b28b930

Longpoll changes response with doc

tleyden commented 8 years ago

The workaround above seems to be reliably working around the problem. I have run the test in a loop 20+ times and it hasn't produced any failures.

Also, I was able to reproduce this issue against the master branch of sync gateway:

http://uberjenkins.sc.couchbase.com:8080/view/QE%20Dev/job/syncgateway-functional-tests-dev/49/

meaning that it was not caused by the changes for this issue, and should not block the PR.

adamcfraser commented 8 years ago

Fixed by #1781