cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.07k stars 3.8k forks source link

roachtest: fails to terminate cleanly when device full #32384

Open tbg opened 5 years ago

tbg commented 5 years ago

I was looking at the history of restore2TB/nodes=10 and wondered why it had a relatively fast passing result on release-2.1. Looking at the logs, I found that roachtest had shot itself since it ran out of space.

For these runs, roachtest itself is in charge of posting issues. This is problematic because nobody watches the watchman. Regarding the discussion about having roachtest be in charge of posting its issues in more places, I think we may want to take the opposite route and not have it post anything any more.

##teamcity[publishArtifacts '/home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20181114-1012425/scaledata/jobcoordinator/nodes=3/** => scaledata/jobcoordinator/nodes=3']
[04:59:09]
schemachange/tpcc/warehouses=1000/nodes=5 (3m:58s)
[04:59:36]
[ 545] rebalance/3to5: waiting for reblance (36m7s)
[04:59:36]
[ 545] restore2TB/nodes=10: running restore (1h8m45s)
[04:59:36]
[ 545] scaledata/jobcoordinator/nodes=6: ??? (0s)
[04:59:36]
[ 545] schemachange/kv: loading fixture (10s)
[04:59:36]
[ 545] schemachange/tpcc/warehouses=1000/nodes=5: ??? (0s)
[04:59:58]
panic: write /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20181114-1012425/rebalance/3to5/test.log: no space left on device [recovered]
[04:59:58]
    panic: write /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20181114-1012425/rebalance/3to5/test.log: no space left on device [recovered]
[04:59:58]
    panic: write /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20181114-1012425/rebalance/3to5/test.log: no space left on device
[04:59:58]
[04:59:58]
goroutine 399624 [running]:
[04:59:58]
main.(*monitor).Go.func1.1(0xc420c27f67, 0xc420c27f88)
[04:59:58]
    /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1457 +0xed
[04:59:58]
panic(0x25f4500, 0xc421bc50e0)
[04:59:58]
    /usr/local/go/src/runtime/panic.go:502 +0x229
[04:59:58]
main.(*monitor).Go.func1.2(0xc420c27f67)
[04:59:58]
    /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1476 +0x85
[04:59:58]
panic(0x25f4500, 0xc421bc50e0)
[04:59:58]
    /usr/local/go/src/runtime/panic.go:502 +0x229
[04:59:58]
main.(*logger).Printf(0xc420c77b80, 0x290346d, 0x3, 0xc420c27d40, 0x1, 0x1)
[04:59:58]
    /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/log.go:219 +0xb3
[04:59:58]
main.waitForRebalance(0x2e2ec40, 0xc421b04700, 0xc420c77b80, 0xc420790640, 0x4045000000000000, 0x0, 0x0)
[04:59:58]
    /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/allocator.go:232 +0x2de
[04:59:58]
main.registerAllocator.func1.3(0x2e2ec40, 0xc421b04700, 0xc4218d2767, 0x0)
[04:59:58]
    /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/allocator.go:73 +0xbf
[04:59:58]
main.(*monitor).Go.func1(0x0, 0x0)
[04:59:58]
    /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1481 +0xd8
[04:59:58]
github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup.(*Group).Go.func1(0xc421b04740, 0xc420a68460)
[04:59:58]
    /home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:58 +0x57
[04:59:58]
created by github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup.(*Group).Go
[04:59:58]
    /home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:55 +0x66
[04:59:58]
+ exit_status=0
[04:59:58]
++ find artifacts/20181114-1012425 -name stats.json
[04:59:58]
+ for file in '$(find ${artifacts#${PWD}/} -name stats.json)'
[04:59:58]
+ gsutil cp artifacts/20181114-1012425/interleavedpartitioned/8.logs/stats.json gs://cockroach-nightly/artifacts/20181114-1012425/interleavedpartitioned/8.logs/stats.json
[05:00:01]
Process SyncManager-1:
[05:00:01]
Traceback (most recent call last):
[05:00:01]
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
[05:00:01]
    self.run()
[05:00:01]
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
[05:00:01]
    self._target(*self._args, **self._kwargs)
[05:00:01]
  File "/usr/lib/python2.7/multiprocessing/managers.py", line 550, in _run_server
[05:00:01]
    server = cls._Server(registry, address, authkey, serializer)
[05:00:01]
  File "/usr/lib/python2.7/multiprocessing/managers.py", line 162, in __init__
[05:00:01]
    self.listener = Listener(address=address, backlog=16)
[05:00:01]
  File "/usr/lib/python2.7/multiprocessing/connection.py", line 127, in __init__
[05:00:01]
    address = address or arbitrary_address(family)
[05:00:01]
  File "/usr/lib/python2.7/multiprocessing/connection.py", line 90, in arbitrary_address
[05:00:01]
    return tempfile.mktemp(prefix='listener-', dir=get_temp_dir())
[05:00:01]
  File "/usr/lib/python2.7/multiprocessing/util.py", line 139, in get_temp_dir
[05:00:01]
    tempdir = tempfile.mkdtemp(prefix='pymp-')
[05:00:01]
  File "/usr/lib/python2.7/tempfile.py", line 331, in mkdtemp
[05:00:01]
    dir = gettempdir()
[05:00:01]
  File "/usr/lib/python2.7/tempfile.py", line 275, in gettempdir
[05:00:01]
    tempdir = _get_default_tempdir()
[05:00:01]
  File "/usr/lib/python2.7/tempfile.py", line 217, in _get_default_tempdir
[05:00:01]
    ("No usable temporary directory found in %s" % dirlist))
[05:00:01]
IOError: [Errno 2] No usable temporary directory found in ['/home/agent/temp/buildTmp', '/home/agent/temp/buildTmp', '/home/agent/temp/buildTmp', '/tmp', '/var/tmp', '/usr/tmp', '/home/agent/work/.go/src/github.com/cockroachdb/cockroach']
[05:00:02]
OSError: No space left on device.

Epic CRDB-10428

Jira issue: CRDB-4752

tbg commented 5 years ago

In addition to not posting an issue, we're also not marking the tests that were uncompleted as failures in teamcity. Our Go test output parser does this.

petermattis commented 5 years ago

Well that's not good. Will an external poster be able to post if the disk is full?

tbg commented 5 years ago

Yes, with the external poster it would see Go test output and fail all tests that weren't explicitly terminated. That's the "test ended in panic" message we see in such cases. It would not, however, fail tests that weren't ever mentioned in the logs. To work around that, roachtest should emit a header for all tests that it's going to run and immediately pause them via RUN/PAUSE/CONT.

github-actions[bot] commented 3 years ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 5 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!

knz commented 3 years ago

still relevant

github-actions[bot] commented 1 year ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!