cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.17k stars 3.82k forks source link

crash: internal error: the first bucket should have NumRange=0 #93892

Open philip-stoev opened 1 year ago

philip-stoev commented 1 year ago

Describe the problem

CRDB is used as a back-end for Materialize, and is repeatedly crashing in our CI system:

internal error: the first bucket should have NumRange=0

To Reproduce

  1. clone the MaterializeInc/materialize repository

  2. cd test/testdrive

  3. ./mzcompose down -v ; ./mzcompose run default --redpanda

  4. Use docker logs on the testdrive-materialized-1 container, which runs CRDB internally, to see repeated CRDB stack traces.

Expected behavior

Do not crash

Additional data / screenshots

testdrive-materialized-1  | DETAIL: stack trace:
testdrive-materialized-1  | github.com/cockroachdb/cockroach/pkg/sql/opt/props/histogram.go:211: filter()
testdrive-materialized-1  | github.com/cockroachdb/cockroach/pkg/sql/opt/props/histogram.go:354: Filter()
testdrive-materialized-1  | github.com/cockroachdb/cockroach/pkg/sql/opt/memo/statistics_builder.go:3583: updateHistogram()
testdrive-materialized-1  | github.com/cockroachdb/cockroach/pkg/sql/opt/memo/statistics_builder.go:3400: applyConstraintSet()
testdrive-materialized-1  | github.com/cockroachdb/cockroach/pkg/sql/opt/memo/statistics_builder.go:3163: applyFiltersItem()
testdrive-materialized-1  | github.com/cockroachdb/cockroach/pkg/sql/opt/memo/statistics_builder.go:3079: applyFilters()
testdrive-materialized-1  | github.com/cockroachdb/cockroach/pkg/sql/opt/memo/statistics_builder.go:3031: filterRelExpr()
testdrive-materialized-1  | github.com/cockroachdb/cockroach/pkg/sql/opt/memo/statistics_builder.go:1013: buildSelect()
testdrive-materialized-1  | github.com/cockroachdb/cockroach/pkg/sql/opt/memo/logical_props_builder.go:288: buildSelectProps()
testdrive-materialized-1  | github.com/cockroachdb/cockroach/bazel-out/k8-opt/bin/pkg/sql/opt/memo/expr.og.go:19827: MemoizeSelect()
testdrive-materialized-1  | github.com/cockroachdb/cockroach/bazel-out/k8-opt/bin/pkg/sql/opt/norm/factory.og.go:1451: ConstructSelect()
testdrive-materialized-1  | github.com/cockroachdb/cockroach/bazel-out/k8-opt/bin/pkg/sql/opt/norm/factory.og.go:23720: CopyAndReplaceDefault()
testdrive-materialized-1  | github.com/cockroachdb/cockroach/pkg/sql/opt/norm/factory.go:326: func2()
testdrive-materialized-1  | github.com/cockroachdb/cockroach/bazel-out/k8-opt/bin/pkg/sql/opt/norm/factory.og.go:25035: invokeReplace()
testdrive-materialized-1  | github.com/cockroachdb/cockroach/bazel-out/k8-opt/bin/pkg/sql/opt/norm/factory.og.go:23772: CopyAndReplaceDefault()
testdrive-materialized-1  | github.com/cockroachdb/cockroach/pkg/sql/opt/norm/factory.go:326: func2()
testdrive-materialized-1  | github.com/cockroachdb/cockroach/bazel-out/k8-opt/bin/pkg/sql/opt/norm/factory.og.go:25035: invokeReplace()
testdrive-materialized-1  | github.com/cockroachdb/cockroach/bazel-out/k8-opt/bin/pkg/sql/opt/norm/factory.og.go:23684: CopyAndReplaceDefault()
testdrive-materialized-1  | github.com/cockroachdb/cockroach/pkg/sql/opt/norm/factory.go:326: func2()
testdrive-materialized-1  | github.com/cockroachdb/cockroach/bazel-out/k8-opt/bin/pkg/sql/opt/norm/factory.og.go:25035: invokeReplace()
testdrive-materialized-1  | github.com/cockroachdb/cockroach/bazel-out/k8-opt/bin/pkg/sql/opt/norm/factory.og.go:23727: CopyAndReplaceDefault()
testdrive-materialized-1  | github.com/cockroachdb/cockroach/pkg/sql/opt/norm/factory.go:326: func2()
testdrive-materialized-1  | github.com/cockroachdb/cockroach/bazel-out/k8-opt/bin/pkg/sql/opt/norm/factory.og.go:25035: invokeReplace()
testdrive-materialized-1  | github.com/cockroachdb/cockroach/pkg/sql/opt/norm/factory.go:285: CopyAndReplace()
testdrive-materialized-1  | github.com/cockroachdb/cockroach/pkg/sql/opt/norm/factory.go:328: AssignPlaceholders()
testdrive-materialized-1  | github.com/cockroachdb/cockroach/pkg/sql/plan_opt.go:488: reuseMemo()
testdrive-materialized-1  | github.com/cockroachdb/cockroach/pkg/sql/plan_opt.go:521: buildExecMemo()
testdrive-materialized-1  | github.com/cockroachdb/cockroach/pkg/sql/plan_opt.go:231: makeOptimizerPlan()
testdrive-materialized-1  | github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:1432: makeExecPlan()
testdrive-materialized-1  | github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:1058: dispatchToExecutionEngine()
testdrive-materialized-1  | github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:687: execStmtInOpenState()
testdrive-materialized-1  | github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:129: func1()
testdrive-materialized-1  | 
testdrive-materialized-1  | HINT: You have encountered an unexpected error.

Environment:

Jira issue: CRDB-22586

gz#17809

blathers-crl[bot] commented 1 year ago

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

If we have not gotten back to your issue within a few business days, you can try the following:

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

cucaroach commented 1 year ago

Hey @philip-stoev! How are you! I tried to reproduce this but couldn't, my docker logs don't have any errors. The last couple lines from the command you provided were:

> SELECT * FROM schema_strategy_test_id
rows didn't match; sleeping to see if dataflow catches up 50ms 75ms 113ms 169ms 253ms^Cmzcompose: test case workflow-default failed: running docker compose failed (exit status 130)
mzcompose: error: at least one test case failed

So maybe I didn't get far enough? What would really be helpful is the query and schema for the failing query, if possible running the query with a "EXPLAIN ANALYZE (DEBUG)" prefix would be ideal. This will produce a statement bundle that should give us everything we need in one place. Let me know if that works.

benesch commented 1 year ago
benesch commented 1 year ago

It occurs to me: this error didn't happen until ~45m into the CI run. That's consistent across all five CI jobs that failed. And we only saw this in our long-running nightly tests. This bug looks related to query statistics. Is it possible there's some buggy statistics job in Cockroach that only kicks in after 40m or so?

michae2 commented 1 year ago

It occurs to me: this error didn't happen until ~45m into the CI run. That's consistent across all five CI jobs that failed. And we only saw this in our long-running nightly tests. This bug looks related to query statistics. Is it possible there's some buggy statistics job in Cockroach that only kicks in after 40m or so?

@benesch yes, definitely possible that the statistics causing this issue are only collected after about 40m into the tests. With automatic statistics enabled, statistics are collected as tables are modified. Assuming the long-running test is constantly modifying this table it's possible that the statistics triggering this crash are collected at roughly 40m.

Before those prepared statements execute, assuming they always crash, it would be helpful to have the results of these statements:

SHOW CREATE TABLE data;
SHOW STATISTICS USING JSON FOR TABLE data;

We'll keep trying to reproduce it here, too.

I may be jumping to conclusions, but one possibility is that this crash is related to the new statistics forecasting feature in v22.2. You could try turning that off to see if it is indeed the problem. Here are some ways to do that, from most general to most specific:

benesch commented 1 year ago

Assuming the long-running test is constantly modifying this table

💡 yes, that sounds like it! That explains why I've seen it happen after just 15-20m on my local machine. My local machine is about twice as fast as the CI machines. It's always triggered around the same point in the test script, both on my local machine and in CI. So the evidence is totally consistent with a statistics jobs that is triggered based on write volume to the table. (We use CockroachDB as basically a key value store, so there's one table that gets written to on basically transaction.)

I'll try disabling stats forecasting and report back.

cucaroach commented 1 year ago

I pulled 231879a3635a9b41eac2fd1bb03a56f6c6dc0a3c and docker logs shows I'm using 22.2. Maybe my machine is too fast...

benesch commented 1 year ago

Hrm. So you're seeing ./mzcompose run default --redpanda run to completion, without experiencing the issue? How long does it take from start to end?

cucaroach commented 1 year ago

I didn't time the first one but it seemed like 10m or so and then I got the error above. Round 2 is still running, seems like its taking longer ... okay I just hit the error. I'll see if I can see what's going on.

benesch commented 1 year ago

Ah fantastic! Looking at this a bit deeper, it seems that the error doesn’t cause a crash, but instead returns the error to the client with a backtrace.

If it’s helpful to be able to connect to the Cockroach instance after it starts producing this error, so that you can interactively run queries against it, you should be able to do something like the following:

./mzcompose exec materialized cockroach sql

I haven’t actually tested this, but the idea is that the cockroach process is running inside the materialized container, and you should be able to exec in and connect to it.

cucaroach commented 1 year ago

What would be helpful is if I could replace the binary with a modified one and re-run the tests, not sure how to do that.

benesch commented 1 year ago

You can edit the Dockerfile in materialize/misc/images/Dockerfile to install whatever custom version of Cockroach you'd like! https://github.com/MaterializeInc/materialize/blob/8b4f685a6da9c2468a6aea3e405b997e077aef6f/misc/images/materialized/Dockerfile#L10-L39

As soon as you make that change, the next execution of ./mzcompose run will automatically rebuild. Note this will take a while to complete the first time, as it'll have to recompile Materialize from scratch instead of just downloading the cached version of the materialized version from Docker Hub.

cucaroach commented 1 year ago

I made this change:

diff --git a/misc/images/materialized/Dockerfile b/misc/images/materialized/Dockerfile
index bb07a75af..92dfa6baa 100644
--- a/misc/images/materialized/Dockerfile
+++ b/misc/images/materialized/Dockerfile
@@ -24,7 +24,8 @@ RUN apt-get update \
     && mkdir /cockroach-data \
     && chown materialize /mzdata /cockroach-data

-COPY --from=crdb /cockroach/cockroach /usr/local/bin/cockroach
+#COPY --from=crdb /cockroach/cockroach /usr/local/bin/cockroach
+COPY cockroach-materialize /usr/local/bin/cockroach

 COPY storaged computed environmentd entrypoint.sh /usr/local/bin/

Doesn't seem to work:

❯ ./mzcompose run default --redpanda 
==> Collecting mzbuild images
materialize/ubuntu-base:mzbuild-HK3XE35BRSUNJPVZJFNR5ZIBS57NGZAL
materialize/materialized:mzbuild-3AHXVLDWH73UMFQG457IBMPMA2FJONUE
materialize/testdrive:mzbuild-O6JF7GBJ2Q23A732N2QVHY2PPVIPI5LX
warning: Docker only has 7.7 GiB of memory available. We recommend at least 8.0 GiB of memory. See https://materialize.com/docs/third-party/docker/.
==> Acquiring materialize/materialized:mzbuild-3AHXVLDWH73UMFQG457IBMPMA2FJONUE
$ docker pull materialize/materialized:mzbuild-3AHXVLDWH73UMFQG457IBMPMA2FJONUE
Error response from daemon: manifest for materialize/materialized:mzbuild-3AHXVLDWH73UMFQG457IBMPMA2FJONUE not found: manifest unknown: manifest unknown
$ git clean -ffdX /Users/treilly/src/materialize/misc/images/materialized
$ brew install materializeinc/crosstools/aarch64-unknown-linux-gnu
Warning: materializeinc/crosstools/aarch64-unknown-linux-gnu 0.1.0 is already installed and up-to-date.
To reinstall 0.1.0, run:
  brew reinstall aarch64-unknown-linux-gnu
$ rustup target add aarch64-unknown-linux-gnu
Traceback (most recent call last):
  File "/Users/treilly/src/materialize/misc/python/materialize/mzbuild.py", line 490, in acquire
    spawn.runv(
  File "/Users/treilly/src/materialize/misc/python/materialize/spawn.py", line 71, in runv
    return subprocess.run(
  File "/opt/homebrew/Cellar/python@3.10/3.10.9/Frameworks/Python.framework/Versions/3.10/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['docker', 'pull', 'materialize/materialized:mzbuild-3AHXVLDWH73UMFQG457IBMPMA2FJONUE']' returned non-zero exit status 1.

Its failing doing this:

❯ docker pull materialize/materialized:mzbuild-3AHXVLDWH73UMFQG457IBMPMA2FJONUE
Error response from daemon: manifest for materialize/materialized:mzbuild-3AHXVLDWH73UMFQG457IBMPMA2FJONUE not found: manifest unknown: manifest unknown

I don't know what's going on but if I revert my change it works fine, maybe my machine isn't equipped to do a full image rebuild.

cucaroach commented 1 year ago

Can I just docker copy a new file in place and just re-run the tests with the existing containers?

benesch commented 1 year ago

I’m struggling to see what went wrong in the snippet you pasted above. It’s meant to be ok for that docker pull to fail, as that’s what triggers the local build. We specifically catch that CalledProcessException error:

https://github.com/MaterializeInc/materialize/blob/main/misc/python/materialize/mzbuild.py#L494

Was there more error output after that? Something like “while handling the above exception, another exception occurred”, followed by a different exception?

benesch commented 1 year ago

Can I just docker copy a new file in place and just re-run the tests with the existing containers?

In theory, for sure, though you’d be treading new ground. I think you could do something like:

$ docker create materialize/materialized:mzbuild-MZBUILDHASH
CID
$ docker cp cockroach CID:/usr/local/bin/cockroach
$ docker commit CID materialize/materialized:mzbuild-MZBUILDHASH

That would locally overwrite your copy of the materialized image at the relevant hash with your patched version including your custom build of Cockroach. Then the next ./mzcompose run would use that cached image. To go back to the upstream image:

$ docker rmi materialize/materialized:mzbuild-HASH
cucaroach commented 1 year ago

Yes, there was another error but I didn't look at it closely, apparently I need to brew install rustup and put .cargo/bin in my PATH to get further. But I didn't get much further, now I get some git authentication error:

❯ ./mzcompose run default --redpanda 
==> Collecting mzbuild images
materialize/ubuntu-base:mzbuild-HK3XE35BRSUNJPVZJFNR5ZIBS57NGZAL
materialize/materialized:mzbuild-2LBPQ73M3CBUOH7X4SISECRVIWWDFIIY
materialize/testdrive:mzbuild-O6JF7GBJ2Q23A732N2QVHY2PPVIPI5LX
warning: Docker only has 7.7 GiB of memory available. We recommend at least 8.0 GiB of memory. See https://materialize.com/docs/third-party/docker/.
==> Acquiring materialize/materialized:mzbuild-2LBPQ73M3CBUOH7X4SISECRVIWWDFIIY
$ docker pull materialize/materialized:mzbuild-2LBPQ73M3CBUOH7X4SISECRVIWWDFIIY
Error response from daemon: manifest for materialize/materialized:mzbuild-2LBPQ73M3CBUOH7X4SISECRVIWWDFIIY not found: manifest unknown: manifest unknown
$ git clean -ffdX /Users/treilly/src/materialize/misc/images/materialized
$ brew install materializeinc/crosstools/aarch64-unknown-linux-gnu
Warning: materializeinc/crosstools/aarch64-unknown-linux-gnu 0.1.0 is already installed and up-to-date.
To reinstall 0.1.0, run:
  brew reinstall aarch64-unknown-linux-gnu
$ rustup target add aarch64-unknown-linux-gnu
info: downloading component 'rust-std' for 'aarch64-unknown-linux-gnu'
info: installing component 'rust-std' for 'aarch64-unknown-linux-gnu'
 39.8 MiB /  39.8 MiB (100 %)  19.3 MiB/s in  2s ETA:  0s
$ env CMAKE_SYSTEM_NAME=Linux CARGO_TARGET_AARCH64_UNKNOWN_LINUX_GNU_LINKER=aarch64-unknown-linux-gnu-cc CARGO_TARGET_DIR=/Users/treilly/src/materialize/target-xcompile TARGET_AR=aarch64-unknown-linux-gnu-ar TARGET_CPP=aarch64-unknown-linux-gnu-cpp TARGET_CC=aarch64-unknown-linux-gnu-cc TARGET_CXX=aarch64-unknown-linux-gnu-c++ TARGET_CXXSTDLIB=static=stdc++ TARGET_LD=aarch64-unknown-linux-gnu-ld TARGET_RANLIB=aarch64-unknown-linux-gnu-ranlib 'RUSTFLAGS=-Clink-arg=-Wl,--compress-debug-sections=zlib -L/opt/homebrew/Cellar/aarch64-unknown-linux-gnu/0.1.0/bin/../aarch64-unknown-linux-gnu/sysroot/lib' cargo build --target aarch64-unknown-linux-gnu --bin storaged --bin computed --bin environmentd --release
warning: /Users/treilly/src/materialize/src/workspace-hack/Cargo.toml: version requirement `0.4.2+5.2.1-patched.2` for dependency `tikv-jemalloc-sys` includes semver metadata which will be ignored, removing the metadata is recommended to avoid confusion
warning: /Users/treilly/src/materialize/src/workspace-hack/Cargo.toml: version requirement `0.4.2+5.2.1-patched.2` for dependency `tikv-jemalloc-sys` includes semver metadata which will be ignored, removing the metadata is recommended to avoid confusion
    Updating git repository `https://github.com/tokio-rs/axum.git`
error: failed to load source for dependency `axum`

Caused by:
  Unable to update https://github.com/tokio-rs/axum.git#71e83291

Caused by:
  failed to clone into: /Users/treilly/.cargo/git/db/axum-3a6345d9aff97fa3

Caused by:
  failed to authenticate when downloading repository: git@github.com:/tokio-rs/axum.git

  * attempted ssh-agent authentication, but no usernames succeeded: `git`

  if the git CLI succeeds then `net.git-fetch-with-cli` may help here
  https://doc.rust-lang.org/cargo/reference/config.html#netgit-fetch-with-cli

Caused by:
  no authentication available
Traceback (most recent call last):
  File "/Users/treilly/src/materialize/misc/python/materialize/mzbuild.py", line 490, in acquire
    spawn.runv(
  File "/Users/treilly/src/materialize/misc/python/materialize/spawn.py", line 71, in runv
    return subprocess.run(
  File "/opt/homebrew/Cellar/python@3.10/3.10.9/Frameworks/Python.framework/Versions/3.10/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['docker', 'pull', 'materialize/materialized:mzbuild-2LBPQ73M3CBUOH7X4SISECRVIWWDFIIY']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/homebrew/Cellar/python@3.10/3.10.9/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/homebrew/Cellar/python@3.10/3.10.9/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/Users/treilly/src/materialize/misc/python/materialize/cli/mzcompose.py", line 681, in <module>
    main(sys.argv[1:])
  File "/Users/treilly/src/materialize/misc/python/materialize/cli/mzcompose.py", line 132, in main
    args.command.invoke(args)
  File "/Users/treilly/src/materialize/misc/python/materialize/cli/mzcompose.py", line 200, in invoke
    self.run(args)
  File "/Users/treilly/src/materialize/misc/python/materialize/cli/mzcompose.py", line 537, in run
    super().run(args)
  File "/Users/treilly/src/materialize/misc/python/materialize/cli/mzcompose.py", line 428, in run
    composition.dependencies.acquire()
  File "/Users/treilly/src/materialize/misc/python/materialize/mzbuild.py", line 639, in acquire
    dep.acquire()
  File "/Users/treilly/src/materialize/misc/python/materialize/mzbuild.py", line 495, in acquire
    self.build()
  File "/Users/treilly/src/materialize/misc/python/materialize/mzbuild.py", line 463, in build
    pre_image.run()
  File "/Users/treilly/src/materialize/misc/python/materialize/mzbuild.py", line 326, in run
    self.build()
  File "/Users/treilly/src/materialize/misc/python/materialize/mzbuild.py", line 257, in build
    spawn.runv(cargo_build, cwd=self.rd.root)
  File "/Users/treilly/src/materialize/misc/python/materialize/spawn.py", line 71, in runv
    return subprocess.run(
  File "/opt/homebrew/Cellar/python@3.10/3.10.9/Frameworks/Python.framework/Versions/3.10/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['env', 'CMAKE_SYSTEM_NAME=Linux', 'CARGO_TARGET_AARCH64_UNKNOWN_LINUX_GNU_LINKER=aarch64-unknown-linux-gnu-cc', 'CARGO_TARGET_DIR=/Users/treilly/src/materialize/target-xcompile', 'TARGET_AR=aarch64-unknown-linux-gnu-ar', 'TARGET_CPP=aarch64-unknown-linux-gnu-cpp', 'TARGET_CC=aarch64-unknown-linux-gnu-cc', 'TARGET_CXX=aarch64-unknown-linux-gnu-c++', 'TARGET_CXXSTDLIB=static=stdc++', 'TARGET_LD=aarch64-unknown-linux-gnu-ld', 'TARGET_RANLIB=aarch64-unknown-linux-gnu-ranlib', 'RUSTFLAGS=-Clink-arg=-Wl,--compress-debug-sections=zlib -L/opt/homebrew/Cellar/aarch64-unknown-linux-gnu/0.1.0/bin/../aarch64-unknown-linux-gnu/sysroot/lib', 'cargo', 'build', '--target', 'aarch64-unknown-linux-gnu', '--bin', 'storaged', '--bin', 'computed', '--bin', 'environmentd', '--release']' returned non-zero exit status 101.

I'll try the untread path next...

benesch commented 1 year ago

For whatever reason the Rust Git impl can’t fetch over SSH. If you follow the link to the config there to use the Git CLI instead, that should work:

if the git CLI succeeds then net.git-fetch-with-cli may help here

https://doc.rust-lang.org/cargo/reference/config.html#netgit-fetch-with-cli

On Wed, Dec 21, 2022 at 2:25 PM Tommy Reilly @.***> wrote:

Yes, there was another error but I didn't look at it closely, apparently I need to brew install rustup and put .cargo/bin in my PATH to get further. But I didn't get much further, now I get some git authentication error:

❯ ./mzcompose run default --redpanda

==> Collecting mzbuild images

materialize/ubuntu-base:mzbuild-HK3XE35BRSUNJPVZJFNR5ZIBS57NGZAL

materialize/materialized:mzbuild-2LBPQ73M3CBUOH7X4SISECRVIWWDFIIY

materialize/testdrive:mzbuild-O6JF7GBJ2Q23A732N2QVHY2PPVIPI5LX

warning: Docker only has 7.7 GiB of memory available. We recommend at least 8.0 GiB of memory. See https://materialize.com/docs/third-party/docker/.

==> Acquiring materialize/materialized:mzbuild-2LBPQ73M3CBUOH7X4SISECRVIWWDFIIY

$ docker pull materialize/materialized:mzbuild-2LBPQ73M3CBUOH7X4SISECRVIWWDFIIY

Error response from daemon: manifest for materialize/materialized:mzbuild-2LBPQ73M3CBUOH7X4SISECRVIWWDFIIY not found: manifest unknown: manifest unknown

$ git clean -ffdX /Users/treilly/src/materialize/misc/images/materialized

$ brew install materializeinc/crosstools/aarch64-unknown-linux-gnu

Warning: materializeinc/crosstools/aarch64-unknown-linux-gnu 0.1.0 is already installed and up-to-date.

To reinstall 0.1.0, run:

brew reinstall aarch64-unknown-linux-gnu

$ rustup target add aarch64-unknown-linux-gnu

info: downloading component 'rust-std' for 'aarch64-unknown-linux-gnu'

info: installing component 'rust-std' for 'aarch64-unknown-linux-gnu'

39.8 MiB / 39.8 MiB (100 %) 19.3 MiB/s in 2s ETA: 0s

$ env CMAKE_SYSTEM_NAME=Linux CARGO_TARGET_AARCH64_UNKNOWN_LINUX_GNU_LINKER=aarch64-unknown-linux-gnu-cc CARGO_TARGET_DIR=/Users/treilly/src/materialize/target-xcompile TARGET_AR=aarch64-unknown-linux-gnu-ar TARGET_CPP=aarch64-unknown-linux-gnu-cpp TARGET_CC=aarch64-unknown-linux-gnu-cc TARGET_CXX=aarch64-unknown-linux-gnu-c++ TARGET_CXXSTDLIB=static=stdc++ TARGET_LD=aarch64-unknown-linux-gnu-ld TARGET_RANLIB=aarch64-unknown-linux-gnu-ranlib 'RUSTFLAGS=-Clink-arg=-Wl,--compress-debug-sections=zlib -L/opt/homebrew/Cellar/aarch64-unknown-linux-gnu/0.1.0/bin/../aarch64-unknown-linux-gnu/sysroot/lib' cargo build --target aarch64-unknown-linux-gnu --bin storaged --bin computed --bin environmentd --release

warning: /Users/treilly/src/materialize/src/workspace-hack/Cargo.toml: version requirement 0.4.2+5.2.1-patched.2 for dependency tikv-jemalloc-sys includes semver metadata which will be ignored, removing the metadata is recommended to avoid confusion

warning: /Users/treilly/src/materialize/src/workspace-hack/Cargo.toml: version requirement 0.4.2+5.2.1-patched.2 for dependency tikv-jemalloc-sys includes semver metadata which will be ignored, removing the metadata is recommended to avoid confusion

Updating git repository `https://github.com/tokio-rs/axum.git` <https://github.com/tokio-rs/axum.git>

error: failed to load source for dependency axum

Caused by:

Unable to update https://github.com/tokio-rs/axum.git#71e83291

Caused by:

failed to clone into: /Users/treilly/.cargo/git/db/axum-3a6345d9aff97fa3

Caused by:

failed to authenticate when downloading repository: @.***:/tokio-rs/axum.git

Caused by:

no authentication available

Traceback (most recent call last):

File "/Users/treilly/src/materialize/misc/python/materialize/mzbuild.py", line 490, in acquire

spawn.runv(

File "/Users/treilly/src/materialize/misc/python/materialize/spawn.py", line 71, in runv

return subprocess.run(

File @.***/3.10.9/Frameworks/Python.framework/Versions/3.10/lib/python3.10/subprocess.py", line 526, in run

raise CalledProcessError(retcode, process.args,

subprocess.CalledProcessError: Command '['docker', 'pull', 'materialize/materialized:mzbuild-2LBPQ73M3CBUOH7X4SISECRVIWWDFIIY']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

File @.***/3.10.9/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main

return _run_code(code, main_globals, None,

File @.***/3.10.9/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 86, in _run_code

exec(code, run_globals)

File "/Users/treilly/src/materialize/misc/python/materialize/cli/mzcompose.py", line 681, in

main(sys.argv[1:])

File "/Users/treilly/src/materialize/misc/python/materialize/cli/mzcompose.py", line 132, in main

args.command.invoke(args)

File "/Users/treilly/src/materialize/misc/python/materialize/cli/mzcompose.py", line 200, in invoke

self.run(args)

File "/Users/treilly/src/materialize/misc/python/materialize/cli/mzcompose.py", line 537, in run

super().run(args)

File "/Users/treilly/src/materialize/misc/python/materialize/cli/mzcompose.py", line 428, in run

composition.dependencies.acquire()

File "/Users/treilly/src/materialize/misc/python/materialize/mzbuild.py", line 639, in acquire

dep.acquire()

File "/Users/treilly/src/materialize/misc/python/materialize/mzbuild.py", line 495, in acquire

self.build()

File "/Users/treilly/src/materialize/misc/python/materialize/mzbuild.py", line 463, in build

pre_image.run()

File "/Users/treilly/src/materialize/misc/python/materialize/mzbuild.py", line 326, in run

self.build()

File "/Users/treilly/src/materialize/misc/python/materialize/mzbuild.py", line 257, in build

spawn.runv(cargo_build, cwd=self.rd.root)

File "/Users/treilly/src/materialize/misc/python/materialize/spawn.py", line 71, in runv

return subprocess.run(

File @.***/3.10.9/Frameworks/Python.framework/Versions/3.10/lib/python3.10/subprocess.py", line 526, in run

raise CalledProcessError(retcode, process.args,

subprocess.CalledProcessError: Command '['env', 'CMAKE_SYSTEM_NAME=Linux', 'CARGO_TARGET_AARCH64_UNKNOWN_LINUX_GNU_LINKER=aarch64-unknown-linux-gnu-cc', 'CARGO_TARGET_DIR=/Users/treilly/src/materialize/target-xcompile', 'TARGET_AR=aarch64-unknown-linux-gnu-ar', 'TARGET_CPP=aarch64-unknown-linux-gnu-cpp', 'TARGET_CC=aarch64-unknown-linux-gnu-cc', 'TARGET_CXX=aarch64-unknown-linux-gnu-c++', 'TARGET_CXXSTDLIB=static=stdc++', 'TARGET_LD=aarch64-unknown-linux-gnu-ld', 'TARGET_RANLIB=aarch64-unknown-linux-gnu-ranlib', 'RUSTFLAGS=-Clink-arg=-Wl,--compress-debug-sections=zlib -L/opt/homebrew/Cellar/aarch64-unknown-linux-gnu/0.1.0/bin/../aarch64-unknown-linux-gnu/sysroot/lib', 'cargo', 'build', '--target', 'aarch64-unknown-linux-gnu', '--bin', 'storaged', '--bin', 'computed', '--bin', 'environmentd', '--release']' returned non-zero exit status 101.

I'll try the untread path next...

— Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/93892#issuecomment-1361989001, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGXSIFLUOHLDYEAD6VMTL3WONKTXANCNFSM6AAAAAATDEKDJI . You are receiving this because you were mentioned.Message ID: @.***>

cucaroach commented 1 year ago

That fixed that issue, thanks!

benesch commented 1 year ago

Sweet! So you were able to get the tests running with your custom build of Cockroach?

cucaroach commented 1 year ago

I was yes, haven't hit paydirt yet, will probably require a couple rounds of spelunking to get to the bottom of it.

benesch commented 1 year ago

Fantastic! Thanks for fighting the fight to get set up with our test infrastructure.

cucaroach commented 1 year ago

I narrowed this down to here in the stats forecasting code.

 ForcecastTableSt(sid:0, tid: 109, ‹0x400b38dd28›) the first bucket should have NumRange=0 not -2.466595e-30
+‹goroutine 72060 [running]:›
+‹runtime/debug.Stack()›
+‹   GOROOT/src/runtime/debug/stack.go:24 +0x64›
+‹github.com/cockroachdb/cockroach/pkg/sql/stats.ForecastTableStatistics({0x59af470, 0x401418a420}, {0x4005135000?, 0x23, 0x40})›
+‹   github.com/cockroachdb/cockroach/pkg/sql/stats/pkg/sql/stats/forecast.go:120 +0x4a0›
+‹github.com/cockroachdb/cockroach/pkg/sql/stats.(*TableStatisticsCache).getTableStatsFromDB(0x4001846dc0, {0x59af470, 0x401418a420}, 0x1?, 0x1)›
+‹   github.com/cockroachdb/cockroach/pkg/sql/stats/pkg/sql/stats/stats_cache.go:749 +0x3d8›
+‹github.com/cockroachdb/cockroach/pkg/sql/stats.(*TableStatisticsCache).refreshCacheEntry.func1(0x4001846dc0, {0x59af470, 0x401418a420}, 0x0?, 0x0?, 0x400f135f48, 0x400f135f28)›
+‹   github.com/cockroachdb/cockroach/pkg/sql/stats/pkg/sql/stats/stats_cache.go:457 +0x12c›
+‹github.com/cockroachdb/cockroach/pkg/sql/stats.(*TableStatisticsCache).refreshCacheEntry(0x4001846dc0, {0x59af470, 0x401418a420}, 0x6a31480?, {0x401498c6e0?, 0xb9e850?, 0x40?})›

Disabling stats forecasts like @michae2 should be a viable work around, we'll work on a repro and fix next.

michae2 commented 1 year ago

@cucaroach if you manage to get SHOW STATISTICS USING JSON output for the table, and the CREATE TABLE statement, that might be enough to reproduce the error.

cucaroach commented 1 year ago

I was able to get a repro by copying the CRDB data off the docker instance and then running EXPLAIN ANALYZE (DEBUG) on the failing DELETE query. Statement bundle:

stmt-bundle-mzbug.zip

Using "debug sb recreate" and then running the query reproduces the problem.

@michae2 do you want to take it from here?

michae2 commented 1 year ago

Several small problems in stats.(*histogram).adjustCounts and related code combined together to cause this error. Here was the sequence of events:

  1. In the most recent statistics collection, integer column data.time had a minimum observed value of -9223372036854775807. While creating the histogram for this column, we called adjustCounts which in turn called addOuterBuckets, which created a new first bucket with upper bound -9223372036854775808 (int64 min). addOuterBuckets added a small amount to NumRange and DistinctRange of the original first bucket, even though there is zero actual range between int64s -9223372036854775808 and -9223372036854775807. This is the first problem.
  2. When converting this histogram to a HistogramData we rounded NumRange but not DistinctRange, so we ended up with a second bucket with zero NumRange but positive DistinctRange. This is the second problem.
      "histo_buckets": [
          {
              "distinct_range": 0,
              "num_eq": 0,
              "num_range": 0,
              "upper_bound": "-9223372036854775808"
          },
          {
              "distinct_range": 2.2737367544323206E-13,
              "num_eq": 146,
              "num_range": 0,
              "upper_bound": "-9223372036854775807"
          },
  3. When forecasting statistics, we then called adjustCounts on this histogram again to create the forecasted histogram. adjustCounts did not expect to find a countable bucket with zero range, NumRange = 0, and DistinctRange > 0, and ended up giving the second bucket a negative NumRange. This is the third problem.
  4. At the end of adjustCounts we called removeZeroBuckets which interpreted the negative NumRange as a reason to remove the first bucket, leaving us with a first bucket with negative NumRange. This is the fourth problem.
  5. Unlike observed histograms, forecasted histograms do not have their NumRanges rounded before giving them to the optimizer. This is the fifth problem.

Fixing any of these would have prevented the error.

tokenrove commented 1 year ago

It looks like these fixes went into 22.2.3, but I am still experiencing this crash on 22.2.3, with this backtrace:

2023-02-08T13:04:53.662581Z ERROR mz_stash::postgres: tokio-postgres stash consolidation error, retry attempt 768: stash error: postgres: db error: ERROR: internal error: the first bucket should have NumRange=0
DETAIL: stack trace:
github.com/cockroachdb/cockroach/pkg/sql/opt/props/histogram.go:244: filter()
github.com/cockroachdb/cockroach/pkg/sql/opt/props/histogram.go:387: Filter()
github.com/cockroachdb/cockroach/pkg/sql/opt/memo/statistics_builder.go:3623: updateHistogram()
github.com/cockroachdb/cockroach/pkg/sql/opt/memo/statistics_builder.go:3440: applyConstraintSet()
github.com/cockroachdb/cockroach/pkg/sql/opt/memo/statistics_builder.go:3178: applyFiltersItem()
github.com/cockroachdb/cockroach/pkg/sql/opt/memo/statistics_builder.go:3094: applyFilters()
github.com/cockroachdb/cockroach/pkg/sql/opt/memo/statistics_builder.go:3046: filterRelExpr()
github.com/cockroachdb/cockroach/pkg/sql/opt/memo/statistics_builder.go:1013: buildSelect()
github.com/cockroachdb/cockroach/pkg/sql/opt/memo/logical_props_builder.go:288: buildSelectProps()
github.com/cockroachdb/cockroach/bazel-out/k8-opt/bin/pkg/sql/opt/memo/expr.og.go:19827: MemoizeSelect()
github.com/cockroachdb/cockroach/bazel-out/k8-opt/bin/pkg/sql/opt/norm/factory.og.go:1451: ConstructSelect()
github.com/cockroachdb/cockroach/bazel-out/k8-opt/bin/pkg/sql/opt/norm/factory.og.go:23753: CopyAndReplaceDefault()
github.com/cockroachdb/cockroach/pkg/sql/opt/norm/factory.go:353: func2()
github.com/cockroachdb/cockroach/bazel-out/k8-opt/bin/pkg/sql/opt/norm/factory.og.go:25068: invokeReplace()
github.com/cockroachdb/cockroach/bazel-out/k8-opt/bin/pkg/sql/opt/norm/factory.og.go:23805: CopyAndReplaceDefault()
github.com/cockroachdb/cockroach/pkg/sql/opt/norm/factory.go:353: func2()
github.com/cockroachdb/cockroach/bazel-out/k8-opt/bin/pkg/sql/opt/norm/factory.og.go:25068: invokeReplace()
github.com/cockroachdb/cockroach/bazel-out/k8-opt/bin/pkg/sql/opt/norm/factory.og.go:23717: CopyAndReplaceDefault()
github.com/cockroachdb/cockroach/pkg/sql/opt/norm/factory.go:353: func2()
github.com/cockroachdb/cockroach/bazel-out/k8-opt/bin/pkg/sql/opt/norm/factory.og.go:25068: invokeReplace()
github.com/cockroachdb/cockroach/bazel-out/k8-opt/bin/pkg/sql/opt/norm/factory.og.go:23760: CopyAndReplaceDefault()
github.com/cockroachdb/cockroach/pkg/sql/opt/norm/factory.go:353: func2()
github.com/cockroachdb/cockroach/bazel-out/k8-opt/bin/pkg/sql/opt/norm/factory.og.go:25068: invokeReplace()
github.com/cockroachdb/cockroach/pkg/sql/opt/norm/factory.go:312: CopyAndReplace()
github.com/cockroachdb/cockroach/pkg/sql/opt/norm/factory.go:355: AssignPlaceholders()
github.com/cockroachdb/cockroach/pkg/sql/plan_opt.go:489: reuseMemo()
github.com/cockroachdb/cockroach/pkg/sql/plan_opt.go:522: buildExecMemo()
github.com/cockroachdb/cockroach/pkg/sql/plan_opt.go:232: makeOptimizerPlan()
github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:1474: makeExecPlan()
github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:1082: dispatchToExecutionEngine()
github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:697: execStmtInOpenState()
github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:131: func1()

HINT: You have encountered an unexpected error.

Please check the public issue tracker to check whether this problem is
already tracked. If you cannot find it there, please report the error
with details by creating a new issue.

If you would rather not post publicly, please contact us directly
using the support form.

We appreciate your feedback.

Is there another version I should be using?

tokenrove commented 1 year ago

~Note that this seems to happen even with SET CLUSTER SETTING sql.stats.forecasts.enabled = false;.~ (No, setting that does seem to fix the problem -- I just didn't realize our scripts didn't reset cockroach if it was already running.)

michae2 commented 1 year ago

@tokenrove thank you for letting us know. Sounds like my fix wasn't quite good enough.

Is there another version I should be using?

Alas, no, v22.2.3 is the correct version containing those fixes.

I will try reproducing the error using the steps above later today or tomorrow.

michae2 commented 1 year ago

@tokenrove letting you know that I haven't had a chance to look yet, things got busy. I'll try to look soon.

msirek commented 1 year ago

Closing as fixed by #94866 @michae2 Please re-open this if you have more work to do on this one. #94866 indicates that it's fixed.

michae2 commented 1 year ago

Closing as fixed by #94866 @michae2 Please re-open this if you have more work to do on this one. #94866 indicates that it's fixed.

Based on customer and sentry reports from v22.2.3 - v23.1.3 it appears that #94866 was not the fix. #104857 will ship in v22.2.12 and v23.1.5 and should reduce the number of errors and provide more debugging information.

yuzefovich commented 1 year ago

FYI we have seen some sentry events that occurred on binaries that included #105584, but it looks like all debugging info is redacted :/

michae2 commented 10 months ago

We only still see sentry reports from 22.2.3, 22.2.12, and 23.1.3. Assuming that https://github.com/cockroachdb/cockroach/pull/113712 fixed this in 22.2.17, 23.1.12, and 23.2.0 I will go ahead and close this.

yuzefovich commented 10 months ago

@michae2 should we revert (perhaps partially) some changes we added in #105584 to aid in debugging this issue?

michae2 commented 10 months ago

@michae2 should we revert (perhaps partially) some changes we added in #105584 to aid in debugging this issue?

Yes, I'll open a PR.

michae2 commented 10 months ago

Ah, drat. I closed this too early. There are more recent sentry reports from 22.2.14 and 23.1.13 using the new report created in https://github.com/cockroachdb/cockroach/pull/105584. So looks like https://github.com/cockroachdb/cockroach/pull/113712 did not fix this. Next step is to figure out why the debugging is still redacted in sentry reports.

yuzefovich commented 1 week ago

@michae2 heads up that the sentry event from #129209 https://cockroach-labs.sentry.io/issues/5729023027/?project=164528&referrer=webhooks_plugin appears to have unredacted full histogram that might shed some light on this issue