Closed cockroach-teamcity closed 6 years ago
Jordan are you inheriting this?
The failure is
east: Error: pq: duplicate key value (session_id)=('E-kTvwizJCWygdTgIgoQHHhgmBDYRAICofZDLypbZUEXrqXciRjavQuTgUWaosTxCYasrKwCpVZMlECYhRRJjSwtguPYQclnbdEJ') violates unique constraint "primary"
Hmm...
There was also a failure here, but I didn't look to see what it was. https://teamcity.cockroachdb.com/viewLog.html?buildId=852805&tab=buildResultsDiv&buildTypeId=Cockroach_Nightlies_NightlySuite
SHA: https://github.com/cockroachdb/cockroach/commits/8a240fa6f241cc9c72da3de64b753e38951d548e
Parameters:
To repro, try:
# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stressrace instead of stress and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stress TESTS=interleavedpartitioned PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-stderr=false -maxtime 20m -timeout 10m'
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=896885&tab=buildLog
SHA: https://github.com/cockroachdb/cockroach/commits/fd520679420038f598c97158672fb201d81ecc13
Parameters:
To repro, try:
# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stressrace instead of stress and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stress TESTS=interleavedpartitioned PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-stderr=false -maxtime 20m -timeout 10m'
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=911908&tab=buildLog
SHA: https://github.com/cockroachdb/cockroach/commits/dc0e73c728e533fdb3bec63e53eec174e920ff22
Parameters:
To repro, try:
# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stressrace instead of stress and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stress TESTS=interleavedpartitioned PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-stderr=false -maxtime 20m -timeout 10m'
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=952220&tab=buildLog
The test failed on release-2.1:
test.go:570,cluster.go:1318,interleavedpartitioned.go:130,interleavedpartitioned.go:138: /home/agent/work/.go/bin/roachprod run teamcity-952220-interleavedpartitioned:1 -- ./workload run interleavedpartitioned --east-zone-name us-east4-b --west-zone-name us-west1-b --central-zone-name us-central1-a --local=false --customers-per-session 2 --devices-per-session 2 --variants-per-session 5 --parameters-per-session 1 --queries-per-session 1 --insert-percent 80 --insert-local-percent 100 --retrieve-percent 10 --retrieve-local-percent 100 --update-percent 10 --update-local-percent 100 --duration 10m --histograms logs/stats.json {pgurl:1-3} returned:
stderr:
Error: pq: duplicate key value (session_id)=('E-CisZlFVUOYmKIFBcDEgPnFSPvXJAiykNLXFaBLGwAqMpjrgCwuCuVGNYvEwLmJPuaBbSjdVrxeCRBsjiMyjMXdfFqegkAVIRuG') violates unique constraint "primary"
Error: exit status 1
stdout:
0.4 insert
4m24s 0 0.0 0.4 0.0 0.0 0.0 0.0 retrieve
4m24s 0 0.0 0.4 0.0 0.0 0.0 0.0 updates
4m25s 0 3.0 3.7 2147.5 2281.7 2281.7 2281.7 insert
4m25s 0 0.0 0.4 0.0 0.0 0.0 0.0 retrieve
4m25s 0 1.0 0.4 939.5 939.5 939.5 939.5 updates
4m26s 0 4.0 3.7 2281.7 2550.1 2550.1 2550.1 insert
4m26s 0 1.0 0.4 704.6 704.6 704.6 704.6 retrieve
4m26s 0 0.0 0.4 0.0 0.0 0.0 0.0 updates
4m27s 0 3.0 3.7 2080.4 2281.7 2281.7 2281.7 insert
4m27s 0 0.0 0.4 0.0 0.0 0.0 0.0 retrieve
4m27s 0 1.0 0.4 805.3 805.3 805.3 805.3 updates
: exit status 1
@jordanlewis could you route this to someone to take a look?
I feel like this could be an issue in the load generator. The workers all use separate rngs. Shouldn't they all share an rng?
@BramGruneir / @knz didn't one of you work on this with Emmanuel? Could you take a look?
That shouldn't happen. I'll take a look.
On Tue, Oct 9, 2018, 08:44 Jordan Lewis notifications@github.com wrote:
I feel like this could be an issue in the load generator. The workers all use separate rngs. Shouldn't they all share an rng?
@BramGruneir https://github.com/BramGruneir / @knz https://github.com/knz didn't one of you work on this with Emmanuel? Could you take a look?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/28921#issuecomment-428177158, or mute the thread https://github.com/notifications/unsubscribe-auth/ABihuXtjsT3McML-bI3MykML6VOTuiw6ks5ujJo8gaJpZM4WGwKZ .
SHA: https://github.com/cockroachdb/cockroach/commits/2215217e8ee38d28a14eb9fd2fe9af8b0b702e7d
Parameters:
To repro, try:
# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stressrace instead of stress and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stress TESTS=interleavedpartitioned PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-stderr=false -maxtime 20m -timeout 10m'
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=958181&tab=buildLog
The test failed on master:
test.go:575,test.go:587: /home/agent/work/.go/bin/roachprod create teamcity-958181-interleavedpartitioned -n 9 --gce-machine-type=n1-standard-4 --gce-zones=us-west1-b,us-east4-b,us-central1-a --geo returned:
stderr:
stdout:
d --boot-disk-size 10 --boot-disk-type pd-ssd --service-account 21965078311-compute@developer.gserviceaccount.com --local-ssd interface=SCSI --machine-type n1-standard-4 --labels lifetime=12h0m0s --metadata-from-file startup-script=/home/agent/temp/buildTmp/gce-startup-script314600310 --project cockroach-ephemeral]
Output: Created [https://www.googleapis.com/compute/v1/projects/cockroach-ephemeral/zones/us-east4-b/instances/teamcity-958181-interleavedpartitioned-0004].
Created [https://www.googleapis.com/compute/v1/projects/cockroach-ephemeral/zones/us-east4-b/instances/teamcity-958181-interleavedpartitioned-0006].
WARNING: Some requests generated warnings:
- The resource 'projects/ubuntu-os-cloud/global/images/ubuntu-1604-xenial-v20171002' is deprecated. A suggested replacement is 'projects/ubuntu-os-cloud/global/images/ubuntu-1604-xenial-v20181004'.
ERROR: (gcloud.compute.instances.create) Could not fetch resource:
- Quota 'CPUS' exceeded. Limit: 24.0 in region us-east4.
: exit status 1
Cleaning up...
: exit status 1
@petermattis quota exceeded. Likely that the 32node tests I added have a part in this:
roachprod list
bram-1539209006-interleavedpartitioned: [gce] 12 (27m40s)
drew-cluster: [gce] 1 (5h27m40s)
drew-n1-high-4: [gce] 1 (6h27m40s)
jesse-tuning: [gce] 12 (43h27m40s)
jordan-geo: [gce] 9 (5h27m40s)
nathan-1539197323-tpcc-nodes-3-w-1000-init: [gce] 4 (8h27m40s)
nathan-jaeger: [gce] 1 (8h27m40s)
teamcity-958181-hibernate: [gce] 1 (10h27m40s)
teamcity-958181-import-tpch-nodes-32: [gce] 32 (10h27m40s)
teamcity-958181-import-tpch-nodes-4: [gce] 4 (11h27m40s)
teamcity-958181-import-tpch-nodes-8: [gce] 8 (11h27m40s)
teamcity-958181-jepsen-batch1: [gce] 6 (12h27m40s)
tim-o-test: [gce] 4 (17h27m40s)
@tschottdorf Yeah, the 32-node test is likely problematic.
@andreimatei You're mucking with the parallelism logic in roachtest
. My long term vision was that roachtest
needs to have a more fine-grained notion of cluster resources. Rather than allowing X clusters to run concurrently, roachtest
should know that there are Y CPU resources to use in a cluster and take that into consideration when allowing a test to run. The test specs already tell us how many nodes and how many CPUs per node a cluster will use. Hardcoding our GCP quota seems fine in the short term. The rest is a small matter of programming. What do you think?
PS Note that the teamcity tests run in us-central
while manually created clusters run in us-east
.
Here's my current thinking:
Right now, the roachtest --parallelism
flag controls two things: how many different clusters will exist at once, and how many tests will compete for local CPU resources at once. The fact that these two orthogonal things are mixed is not great, but completely decoupling them seems hard to me (i.e. to get more resources for the tests themselves, as opposed to the clusters, we'd have to start copying the roachtest binary around and schedule test runs remotely).
With this in mind, what I'm trying to do is introduce a scheduler that:
a) can run multiple tests concurrently
b) respects both limits specified by --parallelism
- #clusters and local resources
c) takes test's cluster type and other types of cluster affinity (e.g. Jepsen) into account and reuses clusters to the greatest degree possible
I'm currently trying a different design than the one you briefly looked at a few days ago: I'm trying a worker pool sized to --parallelism
that does affinity-aware work stealing.
If I rationalize correctly, what you're say is that, we'd want --parallelism
to only refer to local resources (maybe we'll call it --max-parallelism
), and then have a --quota
that caps the sum of resources of all concurrent clusters. I had considered that, but thought that we don't quite need it cause all the clusters I've seen seem to be roughly the same size. I guess I haven't looked hard enough. I will try to incorporate this. I guess I would only take into consideration one type of resource - CPUs - since multi-dimensional schedulers are hard. And then the simplest thing to do, I guess, is refuse to create cluster that would take the resource usage over the limit and always try to create clusters that fit. Which would mean that small tests will tend to run first and large ones later.
I'll see what I can do. hashtag roachborg
I haven't thought about this enough. Perhaps --parallelism
isn't needed if there is quota-based scheduling. What local resources are you worried about limiting that you'd want to keep --parallelism
around? I suppose starting 100 remote clusters concurrently might overly stress your local machine, but once the clusters are set up I think they can run tests fine concurrently. I guess what I'm getting at is that --quota
would be used to limit the cluster resources and internally roachtest should make sure it isn't doing too much cluster manipulation concurrently. Does that require --parallelism
? Maybe. Maybe not.
Well, locally, roachtest runs... the tests. Those tests do stuff, they don't just do "cluster manipulation" operations. You want to have some limit on how many tests run concurrently or they'll starve your CPU. Ideally, a test would declare how CPU-intensive it is. Short of that, the limit can be fairly high by default.
We can't currently run roachtests locally concurrently because of limitations in roachprod
(only 1 local
cluster at a time).
No, I didn't mean "locally" as in --local
. I mean roachtest runs the tests (the t.spec.Run()
function`) on your machine. You can't have an unbounded number of those guys running at a time or they'll thrash; there's got to be some limit.
So I did a bit of a reworking on the original test failure but I don't want to derail this discussion. Perhaps we should move it out of this issue.
SHA: https://github.com/cockroachdb/cockroach/commits/d562a0b4d9a4d640b81a5e1d564e6d4a680fd91a
Parameters:
To repro, try:
# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stressrace instead of stress and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stress TESTS=interleavedpartitioned PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-stderr=false -maxtime 20m -timeout 10m'
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=968121&tab=buildLog
The test failed on master:
test.go:575,test.go:587: /home/agent/work/.go/bin/roachprod create teamcity-968121-interleavedpartitioned -n 9 --gce-machine-type=n1-standard-4 --gce-zones=us-west1-b,us-east4-b,us-central1-a --geo returned:
stderr:
stdout:
Creating cluster teamcity-968121-interleavedpartitioned with 9 nodes
Unable to create cluster:
in provider: gce: Command: gcloud [compute instances create --subnet default --maintenance-policy MIGRATE --scopes default,storage-rw --image ubuntu-1604-xenial-v20181004 --image-project ubuntu-os-cloud --boot-disk-size 10 --boot-disk-type pd-ssd --service-account 21965078311-compute@developer.gserviceaccount.com --local-ssd interface=SCSI --machine-type n1-standard-4 --labels lifetime=12h0m0s --metadata-from-file startup-script=/home/agent/temp/buildTmp/gce-startup-script310566801 --project cockroach-ephemeral]
Output: ERROR: (gcloud.compute.instances.create) Could not fetch resource:
- Quota 'CPUS' exceeded. Limit: 24.0 in region us-east4.
: exit status 1
Cleaning up...
: exit status 1
SHA: https://github.com/cockroachdb/cockroach/commits/5a373445c0674f060a4bfe369ad290a0cacccb6c
Parameters:
To repro, try:
# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stressrace TESTS=interleavedpartitioned PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=982644&tab=buildLog
The test failed on master:
test.go:639,test.go:651: /home/agent/work/.go/bin/roachprod create teamcity-982644-interleavedpartitioned -n 9 --gce-machine-type=n1-standard-4 --gce-zones=us-west1-b,us-east4-b,us-central1-a --geo returned:
stderr:
stdout:
Creating cluster teamcity-982644-interleavedpartitioned with 9 nodes
Unable to create cluster:
in provider: gce: Command: gcloud [compute instances create --subnet default --maintenance-policy MIGRATE --scopes default,storage-rw --image ubuntu-1604-xenial-v20181004 --image-project ubuntu-os-cloud --boot-disk-size 10 --boot-disk-type pd-ssd --service-account 21965078311-compute@developer.gserviceaccount.com --local-ssd interface=SCSI --machine-type n1-standard-4 --labels lifetime=12h0m0s --metadata-from-file startup-script=/home/agent/temp/buildTmp/gce-startup-script858555705 --project cockroach-ephemeral]
Output: ERROR: (gcloud.compute.instances.create) Could not fetch resource:
- Quota 'CPUS' exceeded. Limit: 24.0 in region us-east4.
: exit status 1
Cleaning up...
: exit status 1
Closing this issue, as I've updated the test.
Looks like we're still running into the out of CPU issue (and my updated to the test will only make that a little bit worse too).
So feel free to reopen, but let's reassign appropriately.
SHA: https://github.com/cockroachdb/cockroach/commits/dab7a982c4aea0439df8cadaaa889c2e0db9609b
Parameters:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=852802&tab=buildLog