cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.07k stars 3.8k forks source link

roachtest: interleavedpartitioned failed on release-2.1 #28921

Closed cockroach-teamcity closed 6 years ago

cockroach-teamcity commented 6 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/dab7a982c4aea0439df8cadaaa889c2e0db9609b

Parameters:

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=852802&tab=buildLog

    test.go:494,cluster.go:1104,interleavedpartitioned.go:130,interleavedpartitioned.go:138: /home/agent/work/.go/bin/roachprod run teamcity-852802-interleavedpartitioned:4 -- ./workload run interleavedpartitioned --east-zone-name us-east4-b --west-zone-name us-west1-b --central-zone-name us-central1-a --local=false --customers-per-session 2 --devices-per-session 2 --variants-per-session 5 --parameters-per-session 1 --queries-per-session 1 --insert-percent 80 --insert-local-percent 100 --retrieve-percent 10 --retrieve-local-percent 100 --update-percent 10 --update-local-percent 100 --duration 10m --histograms logs/stats.json {pgurl:4-6}: exit status 1
andreimatei commented 6 years ago

Jordan are you inheriting this?

jordanlewis commented 6 years ago

The failure is

east: Error: pq: duplicate key value (session_id)=('E-kTvwizJCWygdTgIgoQHHhgmBDYRAICofZDLypbZUEXrqXciRjavQuTgUWaosTxCYasrKwCpVZMlECYhRRJjSwtguPYQclnbdEJ') violates unique constraint "primary"

Hmm...

andreimatei commented 6 years ago

There was also a failure here, but I didn't look to see what it was. https://teamcity.cockroachdb.com/viewLog.html?buildId=852805&tab=buildResultsDiv&buildTypeId=Cockroach_Nightlies_NightlySuite

cockroach-teamcity commented 6 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/8a240fa6f241cc9c72da3de64b753e38951d548e

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stressrace instead of stress and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stress TESTS=interleavedpartitioned PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-stderr=false -maxtime 20m -timeout 10m'

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=896885&tab=buildLog

cockroach-teamcity commented 6 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/fd520679420038f598c97158672fb201d81ecc13

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stressrace instead of stress and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stress TESTS=interleavedpartitioned PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-stderr=false -maxtime 20m -timeout 10m'

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=911908&tab=buildLog

cockroach-teamcity commented 6 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/dc0e73c728e533fdb3bec63e53eec174e920ff22

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stressrace instead of stress and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stress TESTS=interleavedpartitioned PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-stderr=false -maxtime 20m -timeout 10m'

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=952220&tab=buildLog

The test failed on release-2.1:
    test.go:570,cluster.go:1318,interleavedpartitioned.go:130,interleavedpartitioned.go:138: /home/agent/work/.go/bin/roachprod run teamcity-952220-interleavedpartitioned:1 -- ./workload run interleavedpartitioned --east-zone-name us-east4-b --west-zone-name us-west1-b --central-zone-name us-central1-a --local=false --customers-per-session 2 --devices-per-session 2 --variants-per-session 5 --parameters-per-session 1 --queries-per-session 1 --insert-percent 80 --insert-local-percent 100 --retrieve-percent 10 --retrieve-local-percent 100 --update-percent 10 --update-local-percent 100 --duration 10m --histograms logs/stats.json {pgurl:1-3} returned:
        stderr:
        Error: pq: duplicate key value (session_id)=('E-CisZlFVUOYmKIFBcDEgPnFSPvXJAiykNLXFaBLGwAqMpjrgCwuCuVGNYvEwLmJPuaBbSjdVrxeCRBsjiMyjMXdfFqegkAVIRuG') violates unique constraint "primary"
        Error:  exit status 1

        stdout:
        0.4 insert
           4m24s        0            0.0            0.4      0.0      0.0      0.0      0.0 retrieve
           4m24s        0            0.0            0.4      0.0      0.0      0.0      0.0 updates
           4m25s        0            3.0            3.7   2147.5   2281.7   2281.7   2281.7 insert
           4m25s        0            0.0            0.4      0.0      0.0      0.0      0.0 retrieve
           4m25s        0            1.0            0.4    939.5    939.5    939.5    939.5 updates
           4m26s        0            4.0            3.7   2281.7   2550.1   2550.1   2550.1 insert
           4m26s        0            1.0            0.4    704.6    704.6    704.6    704.6 retrieve
           4m26s        0            0.0            0.4      0.0      0.0      0.0      0.0 updates
           4m27s        0            3.0            3.7   2080.4   2281.7   2281.7   2281.7 insert
           4m27s        0            0.0            0.4      0.0      0.0      0.0      0.0 retrieve
           4m27s        0            1.0            0.4    805.3    805.3    805.3    805.3 updates
        : exit status 1
tbg commented 6 years ago

@jordanlewis could you route this to someone to take a look?

jordanlewis commented 6 years ago

I feel like this could be an issue in the load generator. The workers all use separate rngs. Shouldn't they all share an rng?

@BramGruneir / @knz didn't one of you work on this with Emmanuel? Could you take a look?

BramGruneir commented 6 years ago

That shouldn't happen. I'll take a look.

On Tue, Oct 9, 2018, 08:44 Jordan Lewis notifications@github.com wrote:

I feel like this could be an issue in the load generator. The workers all use separate rngs. Shouldn't they all share an rng?

@BramGruneir https://github.com/BramGruneir / @knz https://github.com/knz didn't one of you work on this with Emmanuel? Could you take a look?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/28921#issuecomment-428177158, or mute the thread https://github.com/notifications/unsubscribe-auth/ABihuXtjsT3McML-bI3MykML6VOTuiw6ks5ujJo8gaJpZM4WGwKZ .

cockroach-teamcity commented 6 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/2215217e8ee38d28a14eb9fd2fe9af8b0b702e7d

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stressrace instead of stress and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stress TESTS=interleavedpartitioned PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-stderr=false -maxtime 20m -timeout 10m'

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=958181&tab=buildLog

The test failed on master:
    test.go:575,test.go:587: /home/agent/work/.go/bin/roachprod create teamcity-958181-interleavedpartitioned -n 9 --gce-machine-type=n1-standard-4 --gce-zones=us-west1-b,us-east4-b,us-central1-a --geo returned:
        stderr:

        stdout:
        d --boot-disk-size 10 --boot-disk-type pd-ssd --service-account 21965078311-compute@developer.gserviceaccount.com --local-ssd interface=SCSI --machine-type n1-standard-4 --labels lifetime=12h0m0s --metadata-from-file startup-script=/home/agent/temp/buildTmp/gce-startup-script314600310 --project cockroach-ephemeral]
        Output: Created [https://www.googleapis.com/compute/v1/projects/cockroach-ephemeral/zones/us-east4-b/instances/teamcity-958181-interleavedpartitioned-0004].
        Created [https://www.googleapis.com/compute/v1/projects/cockroach-ephemeral/zones/us-east4-b/instances/teamcity-958181-interleavedpartitioned-0006].
        WARNING: Some requests generated warnings:
         - The resource 'projects/ubuntu-os-cloud/global/images/ubuntu-1604-xenial-v20171002' is deprecated. A suggested replacement is 'projects/ubuntu-os-cloud/global/images/ubuntu-1604-xenial-v20181004'.

        ERROR: (gcloud.compute.instances.create) Could not fetch resource:
         - Quota 'CPUS' exceeded.  Limit: 24.0 in region us-east4.

        : exit status 1
        Cleaning up...
        : exit status 1
tbg commented 6 years ago

@petermattis quota exceeded. Likely that the 32node tests I added have a part in this:

roachprod list
bram-1539209006-interleavedpartitioned:      [gce]  12  (27m40s)
drew-cluster:                                [gce]  1   (5h27m40s)
drew-n1-high-4:                              [gce]  1   (6h27m40s)
jesse-tuning:                                [gce]  12  (43h27m40s)
jordan-geo:                                  [gce]  9   (5h27m40s)
nathan-1539197323-tpcc-nodes-3-w-1000-init:  [gce]  4   (8h27m40s)
nathan-jaeger:                               [gce]  1   (8h27m40s)
teamcity-958181-hibernate:                   [gce]  1   (10h27m40s)
teamcity-958181-import-tpch-nodes-32:        [gce]  32  (10h27m40s)
teamcity-958181-import-tpch-nodes-4:         [gce]  4   (11h27m40s)
teamcity-958181-import-tpch-nodes-8:         [gce]  8   (11h27m40s)
teamcity-958181-jepsen-batch1:               [gce]  6   (12h27m40s)
tim-o-test:                                  [gce]  4   (17h27m40s)
petermattis commented 6 years ago

@tschottdorf Yeah, the 32-node test is likely problematic.

@andreimatei You're mucking with the parallelism logic in roachtest. My long term vision was that roachtest needs to have a more fine-grained notion of cluster resources. Rather than allowing X clusters to run concurrently, roachtest should know that there are Y CPU resources to use in a cluster and take that into consideration when allowing a test to run. The test specs already tell us how many nodes and how many CPUs per node a cluster will use. Hardcoding our GCP quota seems fine in the short term. The rest is a small matter of programming. What do you think?

petermattis commented 6 years ago

PS Note that the teamcity tests run in us-central while manually created clusters run in us-east.

andreimatei commented 6 years ago

Here's my current thinking: Right now, the roachtest --parallelism flag controls two things: how many different clusters will exist at once, and how many tests will compete for local CPU resources at once. The fact that these two orthogonal things are mixed is not great, but completely decoupling them seems hard to me (i.e. to get more resources for the tests themselves, as opposed to the clusters, we'd have to start copying the roachtest binary around and schedule test runs remotely). With this in mind, what I'm trying to do is introduce a scheduler that: a) can run multiple tests concurrently b) respects both limits specified by --parallelism - #clusters and local resources c) takes test's cluster type and other types of cluster affinity (e.g. Jepsen) into account and reuses clusters to the greatest degree possible

I'm currently trying a different design than the one you briefly looked at a few days ago: I'm trying a worker pool sized to --parallelism that does affinity-aware work stealing.

If I rationalize correctly, what you're say is that, we'd want --parallelism to only refer to local resources (maybe we'll call it --max-parallelism), and then have a --quota that caps the sum of resources of all concurrent clusters. I had considered that, but thought that we don't quite need it cause all the clusters I've seen seem to be roughly the same size. I guess I haven't looked hard enough. I will try to incorporate this. I guess I would only take into consideration one type of resource - CPUs - since multi-dimensional schedulers are hard. And then the simplest thing to do, I guess, is refuse to create cluster that would take the resource usage over the limit and always try to create clusters that fit. Which would mean that small tests will tend to run first and large ones later. I'll see what I can do. hashtag roachborg

petermattis commented 6 years ago

I haven't thought about this enough. Perhaps --parallelism isn't needed if there is quota-based scheduling. What local resources are you worried about limiting that you'd want to keep --parallelism around? I suppose starting 100 remote clusters concurrently might overly stress your local machine, but once the clusters are set up I think they can run tests fine concurrently. I guess what I'm getting at is that --quota would be used to limit the cluster resources and internally roachtest should make sure it isn't doing too much cluster manipulation concurrently. Does that require --parallelism? Maybe. Maybe not.

andreimatei commented 6 years ago

Well, locally, roachtest runs... the tests. Those tests do stuff, they don't just do "cluster manipulation" operations. You want to have some limit on how many tests run concurrently or they'll starve your CPU. Ideally, a test would declare how CPU-intensive it is. Short of that, the limit can be fairly high by default.

petermattis commented 6 years ago

We can't currently run roachtests locally concurrently because of limitations in roachprod (only 1 local cluster at a time).

andreimatei commented 6 years ago

No, I didn't mean "locally" as in --local. I mean roachtest runs the tests (the t.spec.Run() function`) on your machine. You can't have an unbounded number of those guys running at a time or they'll thrash; there's got to be some limit.

BramGruneir commented 6 years ago

So I did a bit of a reworking on the original test failure but I don't want to derail this discussion. Perhaps we should move it out of this issue.

cockroach-teamcity commented 6 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/d562a0b4d9a4d640b81a5e1d564e6d4a680fd91a

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stressrace instead of stress and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stress TESTS=interleavedpartitioned PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-stderr=false -maxtime 20m -timeout 10m'

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=968121&tab=buildLog

The test failed on master:
    test.go:575,test.go:587: /home/agent/work/.go/bin/roachprod create teamcity-968121-interleavedpartitioned -n 9 --gce-machine-type=n1-standard-4 --gce-zones=us-west1-b,us-east4-b,us-central1-a --geo returned:
        stderr:

        stdout:
        Creating cluster teamcity-968121-interleavedpartitioned with 9 nodes
        Unable to create cluster:
        in provider: gce: Command: gcloud [compute instances create --subnet default --maintenance-policy MIGRATE --scopes default,storage-rw --image ubuntu-1604-xenial-v20181004 --image-project ubuntu-os-cloud --boot-disk-size 10 --boot-disk-type pd-ssd --service-account 21965078311-compute@developer.gserviceaccount.com --local-ssd interface=SCSI --machine-type n1-standard-4 --labels lifetime=12h0m0s --metadata-from-file startup-script=/home/agent/temp/buildTmp/gce-startup-script310566801 --project cockroach-ephemeral]
        Output: ERROR: (gcloud.compute.instances.create) Could not fetch resource:
         - Quota 'CPUS' exceeded. Limit: 24.0 in region us-east4.

        : exit status 1
        Cleaning up...
        : exit status 1
cockroach-teamcity commented 6 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/5a373445c0674f060a4bfe369ad290a0cacccb6c

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=interleavedpartitioned PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=982644&tab=buildLog

The test failed on master:
    test.go:639,test.go:651: /home/agent/work/.go/bin/roachprod create teamcity-982644-interleavedpartitioned -n 9 --gce-machine-type=n1-standard-4 --gce-zones=us-west1-b,us-east4-b,us-central1-a --geo returned:
        stderr:

        stdout:
        Creating cluster teamcity-982644-interleavedpartitioned with 9 nodes
        Unable to create cluster:
        in provider: gce: Command: gcloud [compute instances create --subnet default --maintenance-policy MIGRATE --scopes default,storage-rw --image ubuntu-1604-xenial-v20181004 --image-project ubuntu-os-cloud --boot-disk-size 10 --boot-disk-type pd-ssd --service-account 21965078311-compute@developer.gserviceaccount.com --local-ssd interface=SCSI --machine-type n1-standard-4 --labels lifetime=12h0m0s --metadata-from-file startup-script=/home/agent/temp/buildTmp/gce-startup-script858555705 --project cockroach-ephemeral]
        Output: ERROR: (gcloud.compute.instances.create) Could not fetch resource:
         - Quota 'CPUS' exceeded. Limit: 24.0 in region us-east4.

        : exit status 1
        Cleaning up...
        : exit status 1
BramGruneir commented 6 years ago

Closing this issue, as I've updated the test.
Looks like we're still running into the out of CPU issue (and my updated to the test will only make that a little bit worse too).

So feel free to reopen, but let's reassign appropriately.