sql: nightly benchmark broken

jordanlewis commented 8 years ago

The nightly benchmark seems to have stalled during the sql.test portion. The sql.test.stderr log is growing forever with the following log lines:

I160727 10:14:34.148590 server/status/runtime.go:189  runtime stats: 1.5 GiB RSS, 92 goroutines, 20 MiB/1.1 GiB/1.2 GiB GO alloc/idle/total, 81 MiB/147 MiB CGO alloc/total, 205.60c
go/sec, 0.03/0.01 %(u/s)time, 0.00 %gc (1x)
W160727 10:14:34.158639 ts/db.go:104  node unavailable; try another peer
W160727 10:14:34.158665 server/node.go:642  error recording status summaries: node unavailable; try another peer
W160727 10:14:42.055111 gossip/gossip.go:942  not connected to cluster; use --join to specify a connected node
I160727 10:14:44.028505 server/status/runtime.go:189  runtime stats: 1.5 GiB RSS, 94 goroutines, 21 MiB/1.1 GiB/1.2 GiB GO alloc/idle/total, 81 MiB/147 MiB CGO alloc/total, 215.81c
go/sec, 0.02/0.01 %(u/s)time, 0.00 %gc (0x)

The test invocation on TeamCity is here: https://teamcity.cockroachdb.com/viewLog.html?buildId=5068&buildTypeId=Cockroach_BenchmarkTests&tab=buildLog he nightly benchmark seems to have stalled during the sql.test portion. The sql.test.stderr log is growing forever with the following log lines:

I160727 10:14:34.148590 server/status/runtime.go:189  runtime stats: 1.5 GiB RSS, 92 goroutines, 20 MiB/1.1 GiB/1.2 GiB GO alloc/idle/total, 81 MiB/147 MiB CGO alloc/total, 205.60c
go/sec, 0.03/0.01 %(u/s)time, 0.00 %gc (1x)
W160727 10:14:34.158639 ts/db.go:104  node unavailable; try another peer
W160727 10:14:34.158665 server/node.go:642  error recording status summaries: node unavailable; try another peer
W160727 10:14:42.055111 gossip/gossip.go:942  not connected to cluster; use --join to specify a connected node
I160727 10:14:44.028505 server/status/runtime.go:189  runtime stats: 1.5 GiB RSS, 94 goroutines, 21 MiB/1.1 GiB/1.2 GiB GO alloc/idle/total, 81 MiB/147 MiB CGO alloc/total, 215.81c
go/sec, 0.02/0.01 %(u/s)time, 0.00 %gc (0x)

The test invocation on TeamCity is here: https://teamcity.cockroachdb.com/viewLog.html?buildId=5068&buildTypeId=Cockroach_BenchmarkTests&tab=buildLog

UPDATE: current issue is that sql.BenchmarkPgbenchExec_{Cockroach,Postgres} fail in the absence of pgbench, which we do not install anywhere.

jordanlewis commented 8 years ago

I stopped the build but left the instance running in case someone wants to do forensics.

https://console.cloud.google.com/compute/instancesDetail/zones/us-east1-b/instances/benchmark-static-tests-0?project=cockroach-shared&graph=GCE_CPU&duration=P4D

cuongdo commented 8 years ago

Seems like an issue shutting down the server after a test. Not much else to go on.

jordanlewis commented 8 years ago

@WillHaack seems to have more details on this. Could you chime in?

WillHaack commented 8 years ago

I believe @cuongdo is correct. Some tests may timeout, and when they do I'm not sure they close down the servers properly so the subsequent tests may have trouble binding to ports. (Also, I'm not even sure that they close their connections properly when they successfully run.) I'll try to work on fixing this today. If all else fails it may be worth making it so that multinode tests only run with the -multinode flag until further notice. (IMO it's not a big deal to keep multinode tests that are failing, but it is a big deal if they are causing other tests to fail.)

WillHaack commented 8 years ago

https://github.com/cockroachdb/cockroach/pull/8075 likely fixed the issue. The benchmark tests have been running for 7+ hours now. They were failing consistently in about 2.5hrs before.

jordanlewis commented 8 years ago

@WillHaack I think they're broken in a different way now. Can you please investigate? They've now been running for 30 hours -- seems like they might have stalled again. https://teamcity.cockroachdb.com/viewLog.html?buildId=7571&tab=buildResultsDiv&buildTypeId=Cockroach_BenchmarkTests

tbg commented 7 years ago

Do they now pass, with #10237 in?

tamird commented 7 years ago

Looks like they are broken by the move to azure builders? Tonight's run failed with "terraform is not in your path": https://teamcity.cockroachdb.com/viewLog.html?tab=buildLog&logTab=tree&filter=debug&expand=all&buildId=37243#_focus=724

Probably we should run this orchestration from inside the builder container instead of adding terraform to the TC builders. cc @jordanlewis