Open philip-stoev opened 1 year ago
Hello, I am Blathers. I am here to help you get the issue triaged.
Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.
I have CC'd a few people who may be able to assist you:
If we have not gotten back to your issue within a few business days, you can try the following:
:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.
Hey @philip-stoev! How are you! I tried to reproduce this but couldn't, my docker logs don't have any errors. The last couple lines from the command you provided were:
> SELECT * FROM schema_strategy_test_id
rows didn't match; sleeping to see if dataflow catches up 50ms 75ms 113ms 169ms 253ms^Cmzcompose: test case workflow-default failed: running docker compose failed (exit status 130)
mzcompose: error: at least one test case failed
So maybe I didn't get far enough? What would really be helpful is the query and schema for the failing query, if possible running the query with a "EXPLAIN ANALYZE (DEBUG)" prefix would be ideal. This will produce a statement bundle that should give us everything we need in one place. Let me know if that works.
./mzcompose run default --redpanda
), but rather in the logs of the materialized
container. You can view those via ./mzcompose logs materialized
. For example, here's the output of ./mzcompose run default --redpanda
in our CI last night, and here's the output of the materialized
container from that run. In that second file you'll see lots of error messages like "stash error: postgres: db error: ERROR: internal error: the first bucket should have NumRange=0".It's a bit hard to track this down to a single query, though we can try. Here's the code that's returning the error: https://github.com/MaterializeInc/materialize/blob/f631632edf57ea73e91f9b4faacb99dff8b2cf15/src/stash/src/postgres.rs#L1347-L1384
The definition of those prepared statements is here: https://github.com/MaterializeInc/materialize/blob/f631632edf57ea73e91f9b4faacb99dff8b2cf15/src/stash/src/postgres.rs#L1408-L1435
It occurs to me: this error didn't happen until ~45m into the CI run. That's consistent across all five CI jobs that failed. And we only saw this in our long-running nightly tests. This bug looks related to query statistics. Is it possible there's some buggy statistics job in Cockroach that only kicks in after 40m or so?
It occurs to me: this error didn't happen until ~45m into the CI run. That's consistent across all five CI jobs that failed. And we only saw this in our long-running nightly tests. This bug looks related to query statistics. Is it possible there's some buggy statistics job in Cockroach that only kicks in after 40m or so?
@benesch yes, definitely possible that the statistics causing this issue are only collected after about 40m into the tests. With automatic statistics enabled, statistics are collected as tables are modified. Assuming the long-running test is constantly modifying this table it's possible that the statistics triggering this crash are collected at roughly 40m.
Before those prepared statements execute, assuming they always crash, it would be helpful to have the results of these statements:
SHOW CREATE TABLE data;
SHOW STATISTICS USING JSON FOR TABLE data;
We'll keep trying to reproduce it here, too.
I may be jumping to conclusions, but one possibility is that this crash is related to the new statistics forecasting feature in v22.2. You could try turning that off to see if it is indeed the problem. Here are some ways to do that, from most general to most specific:
SET CLUSTER SETTING sql.stats.forecasts.enabled = false;
for all tables in the whole clusterALTER TABLE data SET (sql_stats_forecasts_enabled = false);
for just this tableSET optimizer_use_forecasts = false;
for just the session executing these prepared statementsAssuming the long-running test is constantly modifying this table
💡 yes, that sounds like it! That explains why I've seen it happen after just 15-20m on my local machine. My local machine is about twice as fast as the CI machines. It's always triggered around the same point in the test script, both on my local machine and in CI. So the evidence is totally consistent with a statistics jobs that is triggered based on write volume to the table. (We use CockroachDB as basically a key value store, so there's one table that gets written to on basically transaction.)
I'll try disabling stats forecasting and report back.
I pulled 231879a3635a9b41eac2fd1bb03a56f6c6dc0a3c and docker logs shows I'm using 22.2. Maybe my machine is too fast...
Hrm. So you're seeing ./mzcompose run default --redpanda
run to completion, without experiencing the issue? How long does it take from start to end?
I didn't time the first one but it seemed like 10m or so and then I got the error above. Round 2 is still running, seems like its taking longer ... okay I just hit the error. I'll see if I can see what's going on.
Ah fantastic! Looking at this a bit deeper, it seems that the error doesn’t cause a crash, but instead returns the error to the client with a backtrace.
If it’s helpful to be able to connect to the Cockroach instance after it starts producing this error, so that you can interactively run queries against it, you should be able to do something like the following:
./mzcompose exec materialized cockroach sql
I haven’t actually tested this, but the idea is that the cockroach process is running inside the materialized
container, and you should be able to exec in and connect to it.
What would be helpful is if I could replace the binary with a modified one and re-run the tests, not sure how to do that.
You can edit the Dockerfile in materialize/misc/images/Dockerfile to install whatever custom version of Cockroach you'd like! https://github.com/MaterializeInc/materialize/blob/8b4f685a6da9c2468a6aea3e405b997e077aef6f/misc/images/materialized/Dockerfile#L10-L39
As soon as you make that change, the next execution of ./mzcompose run
will automatically rebuild. Note this will take a while to complete the first time, as it'll have to recompile Materialize from scratch instead of just downloading the cached version of the materialized
version from Docker Hub.
I made this change:
diff --git a/misc/images/materialized/Dockerfile b/misc/images/materialized/Dockerfile
index bb07a75af..92dfa6baa 100644
--- a/misc/images/materialized/Dockerfile
+++ b/misc/images/materialized/Dockerfile
@@ -24,7 +24,8 @@ RUN apt-get update \
&& mkdir /cockroach-data \
&& chown materialize /mzdata /cockroach-data
-COPY --from=crdb /cockroach/cockroach /usr/local/bin/cockroach
+#COPY --from=crdb /cockroach/cockroach /usr/local/bin/cockroach
+COPY cockroach-materialize /usr/local/bin/cockroach
COPY storaged computed environmentd entrypoint.sh /usr/local/bin/
Doesn't seem to work:
❯ ./mzcompose run default --redpanda
==> Collecting mzbuild images
materialize/ubuntu-base:mzbuild-HK3XE35BRSUNJPVZJFNR5ZIBS57NGZAL
materialize/materialized:mzbuild-3AHXVLDWH73UMFQG457IBMPMA2FJONUE
materialize/testdrive:mzbuild-O6JF7GBJ2Q23A732N2QVHY2PPVIPI5LX
warning: Docker only has 7.7 GiB of memory available. We recommend at least 8.0 GiB of memory. See https://materialize.com/docs/third-party/docker/.
==> Acquiring materialize/materialized:mzbuild-3AHXVLDWH73UMFQG457IBMPMA2FJONUE
$ docker pull materialize/materialized:mzbuild-3AHXVLDWH73UMFQG457IBMPMA2FJONUE
Error response from daemon: manifest for materialize/materialized:mzbuild-3AHXVLDWH73UMFQG457IBMPMA2FJONUE not found: manifest unknown: manifest unknown
$ git clean -ffdX /Users/treilly/src/materialize/misc/images/materialized
$ brew install materializeinc/crosstools/aarch64-unknown-linux-gnu
Warning: materializeinc/crosstools/aarch64-unknown-linux-gnu 0.1.0 is already installed and up-to-date.
To reinstall 0.1.0, run:
brew reinstall aarch64-unknown-linux-gnu
$ rustup target add aarch64-unknown-linux-gnu
Traceback (most recent call last):
File "/Users/treilly/src/materialize/misc/python/materialize/mzbuild.py", line 490, in acquire
spawn.runv(
File "/Users/treilly/src/materialize/misc/python/materialize/spawn.py", line 71, in runv
return subprocess.run(
File "/opt/homebrew/Cellar/python@3.10/3.10.9/Frameworks/Python.framework/Versions/3.10/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['docker', 'pull', 'materialize/materialized:mzbuild-3AHXVLDWH73UMFQG457IBMPMA2FJONUE']' returned non-zero exit status 1.
Its failing doing this:
❯ docker pull materialize/materialized:mzbuild-3AHXVLDWH73UMFQG457IBMPMA2FJONUE
Error response from daemon: manifest for materialize/materialized:mzbuild-3AHXVLDWH73UMFQG457IBMPMA2FJONUE not found: manifest unknown: manifest unknown
I don't know what's going on but if I revert my change it works fine, maybe my machine isn't equipped to do a full image rebuild.
Can I just docker copy a new file in place and just re-run the tests with the existing containers?
I’m struggling to see what went wrong in the snippet you pasted above. It’s meant to be ok for that docker pull
to fail, as that’s what triggers the local build. We specifically catch that CalledProcessException error:
https://github.com/MaterializeInc/materialize/blob/main/misc/python/materialize/mzbuild.py#L494
Was there more error output after that? Something like “while handling the above exception, another exception occurred”, followed by a different exception?
Can I just docker copy a new file in place and just re-run the tests with the existing containers?
In theory, for sure, though you’d be treading new ground. I think you could do something like:
$ docker create materialize/materialized:mzbuild-MZBUILDHASH
CID
$ docker cp cockroach CID:/usr/local/bin/cockroach
$ docker commit CID materialize/materialized:mzbuild-MZBUILDHASH
That would locally overwrite your copy of the materialized image at the relevant hash with your patched version including your custom build of Cockroach. Then the next ./mzcompose run
would use that cached image. To go back to the upstream image:
$ docker rmi materialize/materialized:mzbuild-HASH
Yes, there was another error but I didn't look at it closely, apparently I need to brew install rustup and put .cargo/bin in my PATH to get further. But I didn't get much further, now I get some git authentication error:
❯ ./mzcompose run default --redpanda
==> Collecting mzbuild images
materialize/ubuntu-base:mzbuild-HK3XE35BRSUNJPVZJFNR5ZIBS57NGZAL
materialize/materialized:mzbuild-2LBPQ73M3CBUOH7X4SISECRVIWWDFIIY
materialize/testdrive:mzbuild-O6JF7GBJ2Q23A732N2QVHY2PPVIPI5LX
warning: Docker only has 7.7 GiB of memory available. We recommend at least 8.0 GiB of memory. See https://materialize.com/docs/third-party/docker/.
==> Acquiring materialize/materialized:mzbuild-2LBPQ73M3CBUOH7X4SISECRVIWWDFIIY
$ docker pull materialize/materialized:mzbuild-2LBPQ73M3CBUOH7X4SISECRVIWWDFIIY
Error response from daemon: manifest for materialize/materialized:mzbuild-2LBPQ73M3CBUOH7X4SISECRVIWWDFIIY not found: manifest unknown: manifest unknown
$ git clean -ffdX /Users/treilly/src/materialize/misc/images/materialized
$ brew install materializeinc/crosstools/aarch64-unknown-linux-gnu
Warning: materializeinc/crosstools/aarch64-unknown-linux-gnu 0.1.0 is already installed and up-to-date.
To reinstall 0.1.0, run:
brew reinstall aarch64-unknown-linux-gnu
$ rustup target add aarch64-unknown-linux-gnu
info: downloading component 'rust-std' for 'aarch64-unknown-linux-gnu'
info: installing component 'rust-std' for 'aarch64-unknown-linux-gnu'
39.8 MiB / 39.8 MiB (100 %) 19.3 MiB/s in 2s ETA: 0s
$ env CMAKE_SYSTEM_NAME=Linux CARGO_TARGET_AARCH64_UNKNOWN_LINUX_GNU_LINKER=aarch64-unknown-linux-gnu-cc CARGO_TARGET_DIR=/Users/treilly/src/materialize/target-xcompile TARGET_AR=aarch64-unknown-linux-gnu-ar TARGET_CPP=aarch64-unknown-linux-gnu-cpp TARGET_CC=aarch64-unknown-linux-gnu-cc TARGET_CXX=aarch64-unknown-linux-gnu-c++ TARGET_CXXSTDLIB=static=stdc++ TARGET_LD=aarch64-unknown-linux-gnu-ld TARGET_RANLIB=aarch64-unknown-linux-gnu-ranlib 'RUSTFLAGS=-Clink-arg=-Wl,--compress-debug-sections=zlib -L/opt/homebrew/Cellar/aarch64-unknown-linux-gnu/0.1.0/bin/../aarch64-unknown-linux-gnu/sysroot/lib' cargo build --target aarch64-unknown-linux-gnu --bin storaged --bin computed --bin environmentd --release
warning: /Users/treilly/src/materialize/src/workspace-hack/Cargo.toml: version requirement `0.4.2+5.2.1-patched.2` for dependency `tikv-jemalloc-sys` includes semver metadata which will be ignored, removing the metadata is recommended to avoid confusion
warning: /Users/treilly/src/materialize/src/workspace-hack/Cargo.toml: version requirement `0.4.2+5.2.1-patched.2` for dependency `tikv-jemalloc-sys` includes semver metadata which will be ignored, removing the metadata is recommended to avoid confusion
Updating git repository `https://github.com/tokio-rs/axum.git`
error: failed to load source for dependency `axum`
Caused by:
Unable to update https://github.com/tokio-rs/axum.git#71e83291
Caused by:
failed to clone into: /Users/treilly/.cargo/git/db/axum-3a6345d9aff97fa3
Caused by:
failed to authenticate when downloading repository: git@github.com:/tokio-rs/axum.git
* attempted ssh-agent authentication, but no usernames succeeded: `git`
if the git CLI succeeds then `net.git-fetch-with-cli` may help here
https://doc.rust-lang.org/cargo/reference/config.html#netgit-fetch-with-cli
Caused by:
no authentication available
Traceback (most recent call last):
File "/Users/treilly/src/materialize/misc/python/materialize/mzbuild.py", line 490, in acquire
spawn.runv(
File "/Users/treilly/src/materialize/misc/python/materialize/spawn.py", line 71, in runv
return subprocess.run(
File "/opt/homebrew/Cellar/python@3.10/3.10.9/Frameworks/Python.framework/Versions/3.10/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['docker', 'pull', 'materialize/materialized:mzbuild-2LBPQ73M3CBUOH7X4SISECRVIWWDFIIY']' returned non-zero exit status 1.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/homebrew/Cellar/python@3.10/3.10.9/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/homebrew/Cellar/python@3.10/3.10.9/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/Users/treilly/src/materialize/misc/python/materialize/cli/mzcompose.py", line 681, in <module>
main(sys.argv[1:])
File "/Users/treilly/src/materialize/misc/python/materialize/cli/mzcompose.py", line 132, in main
args.command.invoke(args)
File "/Users/treilly/src/materialize/misc/python/materialize/cli/mzcompose.py", line 200, in invoke
self.run(args)
File "/Users/treilly/src/materialize/misc/python/materialize/cli/mzcompose.py", line 537, in run
super().run(args)
File "/Users/treilly/src/materialize/misc/python/materialize/cli/mzcompose.py", line 428, in run
composition.dependencies.acquire()
File "/Users/treilly/src/materialize/misc/python/materialize/mzbuild.py", line 639, in acquire
dep.acquire()
File "/Users/treilly/src/materialize/misc/python/materialize/mzbuild.py", line 495, in acquire
self.build()
File "/Users/treilly/src/materialize/misc/python/materialize/mzbuild.py", line 463, in build
pre_image.run()
File "/Users/treilly/src/materialize/misc/python/materialize/mzbuild.py", line 326, in run
self.build()
File "/Users/treilly/src/materialize/misc/python/materialize/mzbuild.py", line 257, in build
spawn.runv(cargo_build, cwd=self.rd.root)
File "/Users/treilly/src/materialize/misc/python/materialize/spawn.py", line 71, in runv
return subprocess.run(
File "/opt/homebrew/Cellar/python@3.10/3.10.9/Frameworks/Python.framework/Versions/3.10/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['env', 'CMAKE_SYSTEM_NAME=Linux', 'CARGO_TARGET_AARCH64_UNKNOWN_LINUX_GNU_LINKER=aarch64-unknown-linux-gnu-cc', 'CARGO_TARGET_DIR=/Users/treilly/src/materialize/target-xcompile', 'TARGET_AR=aarch64-unknown-linux-gnu-ar', 'TARGET_CPP=aarch64-unknown-linux-gnu-cpp', 'TARGET_CC=aarch64-unknown-linux-gnu-cc', 'TARGET_CXX=aarch64-unknown-linux-gnu-c++', 'TARGET_CXXSTDLIB=static=stdc++', 'TARGET_LD=aarch64-unknown-linux-gnu-ld', 'TARGET_RANLIB=aarch64-unknown-linux-gnu-ranlib', 'RUSTFLAGS=-Clink-arg=-Wl,--compress-debug-sections=zlib -L/opt/homebrew/Cellar/aarch64-unknown-linux-gnu/0.1.0/bin/../aarch64-unknown-linux-gnu/sysroot/lib', 'cargo', 'build', '--target', 'aarch64-unknown-linux-gnu', '--bin', 'storaged', '--bin', 'computed', '--bin', 'environmentd', '--release']' returned non-zero exit status 101.
I'll try the untread path next...
For whatever reason the Rust Git impl can’t fetch over SSH. If you follow the link to the config there to use the Git CLI instead, that should work:
if the git CLI succeeds then net.git-fetch-with-cli
may help here
https://doc.rust-lang.org/cargo/reference/config.html#netgit-fetch-with-cli
On Wed, Dec 21, 2022 at 2:25 PM Tommy Reilly @.***> wrote:
Yes, there was another error but I didn't look at it closely, apparently I need to brew install rustup and put .cargo/bin in my PATH to get further. But I didn't get much further, now I get some git authentication error:
❯ ./mzcompose run default --redpanda
==> Collecting mzbuild images
materialize/ubuntu-base:mzbuild-HK3XE35BRSUNJPVZJFNR5ZIBS57NGZAL
materialize/materialized:mzbuild-2LBPQ73M3CBUOH7X4SISECRVIWWDFIIY
materialize/testdrive:mzbuild-O6JF7GBJ2Q23A732N2QVHY2PPVIPI5LX
warning: Docker only has 7.7 GiB of memory available. We recommend at least 8.0 GiB of memory. See https://materialize.com/docs/third-party/docker/.
==> Acquiring materialize/materialized:mzbuild-2LBPQ73M3CBUOH7X4SISECRVIWWDFIIY
$ docker pull materialize/materialized:mzbuild-2LBPQ73M3CBUOH7X4SISECRVIWWDFIIY
Error response from daemon: manifest for materialize/materialized:mzbuild-2LBPQ73M3CBUOH7X4SISECRVIWWDFIIY not found: manifest unknown: manifest unknown
$ git clean -ffdX /Users/treilly/src/materialize/misc/images/materialized
$ brew install materializeinc/crosstools/aarch64-unknown-linux-gnu
Warning: materializeinc/crosstools/aarch64-unknown-linux-gnu 0.1.0 is already installed and up-to-date.
To reinstall 0.1.0, run:
brew reinstall aarch64-unknown-linux-gnu
$ rustup target add aarch64-unknown-linux-gnu
info: downloading component 'rust-std' for 'aarch64-unknown-linux-gnu'
info: installing component 'rust-std' for 'aarch64-unknown-linux-gnu'
39.8 MiB / 39.8 MiB (100 %) 19.3 MiB/s in 2s ETA: 0s
$ env CMAKE_SYSTEM_NAME=Linux CARGO_TARGET_AARCH64_UNKNOWN_LINUX_GNU_LINKER=aarch64-unknown-linux-gnu-cc CARGO_TARGET_DIR=/Users/treilly/src/materialize/target-xcompile TARGET_AR=aarch64-unknown-linux-gnu-ar TARGET_CPP=aarch64-unknown-linux-gnu-cpp TARGET_CC=aarch64-unknown-linux-gnu-cc TARGET_CXX=aarch64-unknown-linux-gnu-c++ TARGET_CXXSTDLIB=static=stdc++ TARGET_LD=aarch64-unknown-linux-gnu-ld TARGET_RANLIB=aarch64-unknown-linux-gnu-ranlib 'RUSTFLAGS=-Clink-arg=-Wl,--compress-debug-sections=zlib -L/opt/homebrew/Cellar/aarch64-unknown-linux-gnu/0.1.0/bin/../aarch64-unknown-linux-gnu/sysroot/lib' cargo build --target aarch64-unknown-linux-gnu --bin storaged --bin computed --bin environmentd --release
warning: /Users/treilly/src/materialize/src/workspace-hack/Cargo.toml: version requirement
0.4.2+5.2.1-patched.2
for dependencytikv-jemalloc-sys
includes semver metadata which will be ignored, removing the metadata is recommended to avoid confusionwarning: /Users/treilly/src/materialize/src/workspace-hack/Cargo.toml: version requirement
0.4.2+5.2.1-patched.2
for dependencytikv-jemalloc-sys
includes semver metadata which will be ignored, removing the metadata is recommended to avoid confusionUpdating git repository `https://github.com/tokio-rs/axum.git` <https://github.com/tokio-rs/axum.git>
error: failed to load source for dependency
axum
Caused by:
Unable to update https://github.com/tokio-rs/axum.git#71e83291
Caused by:
failed to clone into: /Users/treilly/.cargo/git/db/axum-3a6345d9aff97fa3
Caused by:
failed to authenticate when downloading repository: @.***:/tokio-rs/axum.git
attempted ssh-agent authentication, but no usernames succeeded:
git
if the git CLI succeeds then
net.git-fetch-with-cli
may help herehttps://doc.rust-lang.org/cargo/reference/config.html#netgit-fetch-with-cli
Caused by:
no authentication available
Traceback (most recent call last):
File "/Users/treilly/src/materialize/misc/python/materialize/mzbuild.py", line 490, in acquire
spawn.runv(
File "/Users/treilly/src/materialize/misc/python/materialize/spawn.py", line 71, in runv
return subprocess.run(
File @.***/3.10.9/Frameworks/Python.framework/Versions/3.10/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['docker', 'pull', 'materialize/materialized:mzbuild-2LBPQ73M3CBUOH7X4SISECRVIWWDFIIY']' returned non-zero exit status 1.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File @.***/3.10.9/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File @.***/3.10.9/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/Users/treilly/src/materialize/misc/python/materialize/cli/mzcompose.py", line 681, in
main(sys.argv[1:])
File "/Users/treilly/src/materialize/misc/python/materialize/cli/mzcompose.py", line 132, in main
args.command.invoke(args)
File "/Users/treilly/src/materialize/misc/python/materialize/cli/mzcompose.py", line 200, in invoke
self.run(args)
File "/Users/treilly/src/materialize/misc/python/materialize/cli/mzcompose.py", line 537, in run
super().run(args)
File "/Users/treilly/src/materialize/misc/python/materialize/cli/mzcompose.py", line 428, in run
composition.dependencies.acquire()
File "/Users/treilly/src/materialize/misc/python/materialize/mzbuild.py", line 639, in acquire
dep.acquire()
File "/Users/treilly/src/materialize/misc/python/materialize/mzbuild.py", line 495, in acquire
self.build()
File "/Users/treilly/src/materialize/misc/python/materialize/mzbuild.py", line 463, in build
pre_image.run()
File "/Users/treilly/src/materialize/misc/python/materialize/mzbuild.py", line 326, in run
self.build()
File "/Users/treilly/src/materialize/misc/python/materialize/mzbuild.py", line 257, in build
spawn.runv(cargo_build, cwd=self.rd.root)
File "/Users/treilly/src/materialize/misc/python/materialize/spawn.py", line 71, in runv
return subprocess.run(
File @.***/3.10.9/Frameworks/Python.framework/Versions/3.10/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['env', 'CMAKE_SYSTEM_NAME=Linux', 'CARGO_TARGET_AARCH64_UNKNOWN_LINUX_GNU_LINKER=aarch64-unknown-linux-gnu-cc', 'CARGO_TARGET_DIR=/Users/treilly/src/materialize/target-xcompile', 'TARGET_AR=aarch64-unknown-linux-gnu-ar', 'TARGET_CPP=aarch64-unknown-linux-gnu-cpp', 'TARGET_CC=aarch64-unknown-linux-gnu-cc', 'TARGET_CXX=aarch64-unknown-linux-gnu-c++', 'TARGET_CXXSTDLIB=static=stdc++', 'TARGET_LD=aarch64-unknown-linux-gnu-ld', 'TARGET_RANLIB=aarch64-unknown-linux-gnu-ranlib', 'RUSTFLAGS=-Clink-arg=-Wl,--compress-debug-sections=zlib -L/opt/homebrew/Cellar/aarch64-unknown-linux-gnu/0.1.0/bin/../aarch64-unknown-linux-gnu/sysroot/lib', 'cargo', 'build', '--target', 'aarch64-unknown-linux-gnu', '--bin', 'storaged', '--bin', 'computed', '--bin', 'environmentd', '--release']' returned non-zero exit status 101.
I'll try the untread path next...
— Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/93892#issuecomment-1361989001, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGXSIFLUOHLDYEAD6VMTL3WONKTXANCNFSM6AAAAAATDEKDJI . You are receiving this because you were mentioned.Message ID: @.***>
That fixed that issue, thanks!
Sweet! So you were able to get the tests running with your custom build of Cockroach?
I was yes, haven't hit paydirt yet, will probably require a couple rounds of spelunking to get to the bottom of it.
Fantastic! Thanks for fighting the fight to get set up with our test infrastructure.
I narrowed this down to here in the stats forecasting code.
ForcecastTableSt(sid:0, tid: 109, ‹0x400b38dd28›) the first bucket should have NumRange=0 not -2.466595e-30
+‹goroutine 72060 [running]:›
+‹runtime/debug.Stack()›
+‹ GOROOT/src/runtime/debug/stack.go:24 +0x64›
+‹github.com/cockroachdb/cockroach/pkg/sql/stats.ForecastTableStatistics({0x59af470, 0x401418a420}, {0x4005135000?, 0x23, 0x40})›
+‹ github.com/cockroachdb/cockroach/pkg/sql/stats/pkg/sql/stats/forecast.go:120 +0x4a0›
+‹github.com/cockroachdb/cockroach/pkg/sql/stats.(*TableStatisticsCache).getTableStatsFromDB(0x4001846dc0, {0x59af470, 0x401418a420}, 0x1?, 0x1)›
+‹ github.com/cockroachdb/cockroach/pkg/sql/stats/pkg/sql/stats/stats_cache.go:749 +0x3d8›
+‹github.com/cockroachdb/cockroach/pkg/sql/stats.(*TableStatisticsCache).refreshCacheEntry.func1(0x4001846dc0, {0x59af470, 0x401418a420}, 0x0?, 0x0?, 0x400f135f48, 0x400f135f28)›
+‹ github.com/cockroachdb/cockroach/pkg/sql/stats/pkg/sql/stats/stats_cache.go:457 +0x12c›
+‹github.com/cockroachdb/cockroach/pkg/sql/stats.(*TableStatisticsCache).refreshCacheEntry(0x4001846dc0, {0x59af470, 0x401418a420}, 0x6a31480?, {0x401498c6e0?, 0xb9e850?, 0x40?})›
Disabling stats forecasts like @michae2 should be a viable work around, we'll work on a repro and fix next.
@cucaroach if you manage to get SHOW STATISTICS USING JSON
output for the table, and the CREATE TABLE
statement, that might be enough to reproduce the error.
I was able to get a repro by copying the CRDB data off the docker instance and then running EXPLAIN ANALYZE (DEBUG) on the failing DELETE query. Statement bundle:
Using "debug sb recreate" and then running the query reproduces the problem.
@michae2 do you want to take it from here?
Several small problems in stats.(*histogram).adjustCounts
and related code combined together to cause this error. Here was the sequence of events:
data.time
had a minimum observed value of -9223372036854775807. While creating the histogram for this column, we called adjustCounts
which in turn called addOuterBuckets
, which created a new first bucket with upper bound -9223372036854775808 (int64 min). addOuterBuckets
added a small amount to NumRange and DistinctRange of the original first bucket, even though there is zero actual range between int64s -9223372036854775808 and -9223372036854775807. This is the first problem.histogram
to a HistogramData
we rounded NumRange but not DistinctRange, so we ended up with a second bucket with zero NumRange
but positive DistinctRange
. This is the second problem.
"histo_buckets": [
{
"distinct_range": 0,
"num_eq": 0,
"num_range": 0,
"upper_bound": "-9223372036854775808"
},
{
"distinct_range": 2.2737367544323206E-13,
"num_eq": 146,
"num_range": 0,
"upper_bound": "-9223372036854775807"
},
adjustCounts
on this histogram again to create the forecasted histogram. adjustCounts
did not expect to find a countable bucket with zero range, NumRange = 0, and DistinctRange > 0, and ended up giving the second bucket a negative NumRange. This is the third problem.adjustCounts
we called removeZeroBuckets
which interpreted the negative NumRange as a reason to remove the first bucket, leaving us with a first bucket with negative NumRange. This is the fourth problem.Fixing any of these would have prevented the error.
It looks like these fixes went into 22.2.3, but I am still experiencing this crash on 22.2.3, with this backtrace:
2023-02-08T13:04:53.662581Z ERROR mz_stash::postgres: tokio-postgres stash consolidation error, retry attempt 768: stash error: postgres: db error: ERROR: internal error: the first bucket should have NumRange=0
DETAIL: stack trace:
github.com/cockroachdb/cockroach/pkg/sql/opt/props/histogram.go:244: filter()
github.com/cockroachdb/cockroach/pkg/sql/opt/props/histogram.go:387: Filter()
github.com/cockroachdb/cockroach/pkg/sql/opt/memo/statistics_builder.go:3623: updateHistogram()
github.com/cockroachdb/cockroach/pkg/sql/opt/memo/statistics_builder.go:3440: applyConstraintSet()
github.com/cockroachdb/cockroach/pkg/sql/opt/memo/statistics_builder.go:3178: applyFiltersItem()
github.com/cockroachdb/cockroach/pkg/sql/opt/memo/statistics_builder.go:3094: applyFilters()
github.com/cockroachdb/cockroach/pkg/sql/opt/memo/statistics_builder.go:3046: filterRelExpr()
github.com/cockroachdb/cockroach/pkg/sql/opt/memo/statistics_builder.go:1013: buildSelect()
github.com/cockroachdb/cockroach/pkg/sql/opt/memo/logical_props_builder.go:288: buildSelectProps()
github.com/cockroachdb/cockroach/bazel-out/k8-opt/bin/pkg/sql/opt/memo/expr.og.go:19827: MemoizeSelect()
github.com/cockroachdb/cockroach/bazel-out/k8-opt/bin/pkg/sql/opt/norm/factory.og.go:1451: ConstructSelect()
github.com/cockroachdb/cockroach/bazel-out/k8-opt/bin/pkg/sql/opt/norm/factory.og.go:23753: CopyAndReplaceDefault()
github.com/cockroachdb/cockroach/pkg/sql/opt/norm/factory.go:353: func2()
github.com/cockroachdb/cockroach/bazel-out/k8-opt/bin/pkg/sql/opt/norm/factory.og.go:25068: invokeReplace()
github.com/cockroachdb/cockroach/bazel-out/k8-opt/bin/pkg/sql/opt/norm/factory.og.go:23805: CopyAndReplaceDefault()
github.com/cockroachdb/cockroach/pkg/sql/opt/norm/factory.go:353: func2()
github.com/cockroachdb/cockroach/bazel-out/k8-opt/bin/pkg/sql/opt/norm/factory.og.go:25068: invokeReplace()
github.com/cockroachdb/cockroach/bazel-out/k8-opt/bin/pkg/sql/opt/norm/factory.og.go:23717: CopyAndReplaceDefault()
github.com/cockroachdb/cockroach/pkg/sql/opt/norm/factory.go:353: func2()
github.com/cockroachdb/cockroach/bazel-out/k8-opt/bin/pkg/sql/opt/norm/factory.og.go:25068: invokeReplace()
github.com/cockroachdb/cockroach/bazel-out/k8-opt/bin/pkg/sql/opt/norm/factory.og.go:23760: CopyAndReplaceDefault()
github.com/cockroachdb/cockroach/pkg/sql/opt/norm/factory.go:353: func2()
github.com/cockroachdb/cockroach/bazel-out/k8-opt/bin/pkg/sql/opt/norm/factory.og.go:25068: invokeReplace()
github.com/cockroachdb/cockroach/pkg/sql/opt/norm/factory.go:312: CopyAndReplace()
github.com/cockroachdb/cockroach/pkg/sql/opt/norm/factory.go:355: AssignPlaceholders()
github.com/cockroachdb/cockroach/pkg/sql/plan_opt.go:489: reuseMemo()
github.com/cockroachdb/cockroach/pkg/sql/plan_opt.go:522: buildExecMemo()
github.com/cockroachdb/cockroach/pkg/sql/plan_opt.go:232: makeOptimizerPlan()
github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:1474: makeExecPlan()
github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:1082: dispatchToExecutionEngine()
github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:697: execStmtInOpenState()
github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:131: func1()
HINT: You have encountered an unexpected error.
Please check the public issue tracker to check whether this problem is
already tracked. If you cannot find it there, please report the error
with details by creating a new issue.
If you would rather not post publicly, please contact us directly
using the support form.
We appreciate your feedback.
Is there another version I should be using?
~Note that this seems to happen even with SET CLUSTER SETTING sql.stats.forecasts.enabled = false;
.~ (No, setting that does seem to fix the problem -- I just didn't realize our scripts didn't reset cockroach if it was already running.)
@tokenrove thank you for letting us know. Sounds like my fix wasn't quite good enough.
Is there another version I should be using?
Alas, no, v22.2.3 is the correct version containing those fixes.
I will try reproducing the error using the steps above later today or tomorrow.
@tokenrove letting you know that I haven't had a chance to look yet, things got busy. I'll try to look soon.
Closing as fixed by #94866 @michae2 Please re-open this if you have more work to do on this one. #94866 indicates that it's fixed.
Closing as fixed by #94866 @michae2 Please re-open this if you have more work to do on this one. #94866 indicates that it's fixed.
Based on customer and sentry reports from v22.2.3 - v23.1.3 it appears that #94866 was not the fix. #104857 will ship in v22.2.12 and v23.1.5 and should reduce the number of errors and provide more debugging information.
FYI we have seen some sentry events that occurred on binaries that included #105584, but it looks like all debugging info is redacted :/
We only still see sentry reports from 22.2.3, 22.2.12, and 23.1.3. Assuming that https://github.com/cockroachdb/cockroach/pull/113712 fixed this in 22.2.17, 23.1.12, and 23.2.0 I will go ahead and close this.
@michae2 should we revert (perhaps partially) some changes we added in #105584 to aid in debugging this issue?
@michae2 should we revert (perhaps partially) some changes we added in #105584 to aid in debugging this issue?
Yes, I'll open a PR.
Ah, drat. I closed this too early. There are more recent sentry reports from 22.2.14 and 23.1.13 using the new report created in https://github.com/cockroachdb/cockroach/pull/105584. So looks like https://github.com/cockroachdb/cockroach/pull/113712 did not fix this. Next step is to figure out why the debugging is still redacted in sentry reports.
@michae2 heads up that the sentry event from #129209 https://cockroach-labs.sentry.io/issues/5729023027/?project=164528&referrer=webhooks_plugin appears to have unredacted full histogram that might shed some light on this issue
Describe the problem
CRDB is used as a back-end for Materialize, and is repeatedly crashing in our CI system:
To Reproduce
clone the MaterializeInc/materialize repository
cd test/testdrive
./mzcompose down -v ; ./mzcompose run default --redpanda
Use
docker logs
on thetestdrive-materialized-1
container, which runs CRDB internally, to see repeated CRDB stack traces.Expected behavior
Do not crash
Additional data / screenshots
Environment:
Jira issue: CRDB-22586
gz#17809