Open cockroach-teamcity opened 1 year ago
It looks like sb.md
is nil
here: https://github.com/cockroachdb/cockroach/blob/1698b4960e4576de14d8fe0d5d32eacf934f5bb4/pkg/sql/opt/memo/statistics_builder.go#L589
This is possible if *Memo.SetRoot
is called before this point. I haven't figured out a reproduction yet. However, notice these comments, which may be relevant:
I noticed a problem where ReleaseMemo()
was causing Factory.mem
to be nil'ed out, causing a similar problem, and I resolved it like this in my local branch:
https://github.com/msirek/cockroach/blob/b3427f27bfc5f2994eb984a5c58e13619ff5101a/pkg/sql/opt/norm/factory.go#L180-L181
It may not be the same root cause, but maybe this is also a released memo case, where calling Memo.Init
will make sure that the logPropsBuilder
remains valid.
Let's guard against the nil and log this case and/or report to Sentry so we can try to diagnose if it happens again.
I encountered this error when running the diesel print-schema
command from the rust diesel framework on a freshly created single node database. I wonder if information schema related queries could trigger this.
@liningpan thanks for letting us know! We'll try to reproduce this. If there are any other steps you performed to encounter this bug, please let us know.
The issue appears to be transient, and I can only reproduce the issue immediately after deleting a database and creating a new one with two table. The problem will go away after a few minutes.
The query caused the problem is SELECT obj_description($1::REGCLASS, $2)
, but I cannot manually trigger this with PREPARE/EXECUTE
. Only diesel_cli can reliably trigger the error, which I believe uses libpq.
I was unable to reproduce this with the following steps:
Install rust/diesel:
brew install rust
brew install diesel
Start up a single node cockroach cluster:
cockroach-v23.1.2.darwin-10.9-amd64/cockroach start-single-node --insecure
cockroach-v23.1.2.darwin-10.9-amd64/cockroach sql --insecure
Create, delete, and recreate a database with tables in the SQL shell:
create database db;
create table db.t (a int);
create table db.s (a int);
drop database db cascade;
create database db;
create table db.t (a int);
create table db.s (a int);
Immediately diesel print-schema:
diesel print-schema --database-url=[cockroach sql url to database db]
// @generated automatically by Diesel CLI.
diesel::table! {
s (rowid) {
a -> Nullable<Int8>,
rowid -> Int8,
}
}
diesel::table! {
t (rowid) {
a -> Nullable<Int8>,
rowid -> Int8,
}
}
diesel::allow_tables_to_appear_in_same_query!(
s,
t,
);
@liningpan Do these steps mimic what you were doing to recreate the issue? Do you have any other tips that could help us reproduce this? Thanks in advance.
I tried the same steps as @rharding6373 on a gceworker (except for different rust and diesel installation steps), and also could not reproduce the problem.
@liningpan if you could provide the DDL of the tables you are creating or any other other details, that might help us reproduce the issue.
Let's implement the safeguard suggested in https://github.com/cockroachdb/cockroach/issues/104009#issuecomment-1579288607 and then close this issue, since we don't have a reproduction.
I'm not sure how I came to the conclusion that sb.md
is nil: https://github.com/cockroachdb/cockroach/issues/104009#issuecomment-1572578619
Maybe it's md.tables
or md.tables[...].anns
?
Hi! Sorry for the late response. I've been trying to create a minimal example, and here it is https://github.com/liningpan/cockroach-null-ptr.
This example is adapted from one of the unit tests in diesel, and a notable difference is that it is executed in a test transaction
.
With the rust binary program I linked above, I can reliably trigger the nil pointer dereference error both on my M2 Mac and an x86 Linux machine on a fresh single node cockroach-v23.1.5
.
rm -rf cockroach-data
cockroach start-single-node --insecure
DATABASE_URL='postgresql://root@localhost:26257/defaultdb?sslmode=disable' cargo run
Let me know if this helps you guys reproduce the issues! To me, it seems like some sort of race condition or things not being initialized properly.
Thanks for going through all the effort to repro this for us, @liningpan! I've managed to repro this reliably using these steps.
I used Wireshark to capture the client-server traffic. Here is a repro; we can add this to a file in pkg/sql/pgwire/testdata/pgtest/
and then run with ./dev test pkg/sql/pgwire -f=TestPGTest/file_name
.
send
Query {"String": "DROP TABLE IF EXISTS test_schema.table_1 CASCADE"}
Query {"String": "DROP SCHEMA IF EXISTS test_schema CASCADE"}
Query {"String": "CREATE SCHEMA test_schema"}
Query {"String": "CREATE TABLE test_schema.table_1 (id SERIAL PRIMARY KEY, text_col VARCHAR, not_null TEXT NOT NULL)"}
Query {"String": "COMMENT ON TABLE test_schema.table_1 IS 'table comment'"}
----
until ignore=NoticeResponse
ReadyForQuery
ReadyForQuery
ReadyForQuery
ReadyForQuery
ReadyForQuery
----
{"Type":"CommandComplete","CommandTag":"DROP TABLE"}
{"Type":"ReadyForQuery","TxStatus":"I"}
{"Type":"CommandComplete","CommandTag":"DROP SCHEMA"}
{"Type":"ReadyForQuery","TxStatus":"I"}
{"Type":"CommandComplete","CommandTag":"CREATE SCHEMA"}
{"Type":"ReadyForQuery","TxStatus":"I"}
{"Type":"CommandComplete","CommandTag":"CREATE TABLE"}
{"Type":"ReadyForQuery","TxStatus":"I"}
{"Type":"CommandComplete","CommandTag":"COMMENT ON TABLE"}
{"Type":"ReadyForQuery","TxStatus":"I"}
send
Parse {"Query": "SELECT obj_description($1::regclass, $2)", "ParameterOIDs": [25, 25]}
Bind {"ParameterFormatCodes": [1,1], "Parameters": [{"binary":"746573745f736368656d612e7461626c655f31"}, {"binary":"70675f636c617373"}]}
Execute
Sync
----
until
ReadyForQuery
----
{"Type":"ParseComplete"}
{"Type":"BindComplete"}
{"Type":"DataRow","Values":[{"text":"table comment"}]}
{"Type":"CommandComplete","CommandTag":"SELECT 1"}
{"Type":"ReadyForQuery","TxStatus":"I"}
Looks like it only happens when using a prepared statement with binary-encoded parameters, and it might be specific to obj_description
.
Thanks @rafiss! I should be able to diagnose the problem from here.
Here's a miminized reproduction that works without any specific pgwire encoding:
CREATE TABLE d (d1 STRING);
CREATE TABLE c (c1 STRING, INDEX (c1));
CREATE FUNCTION f(x STRING) RETURNS STRING STABLE LANGUAGE SQL AS $$
SELECT 1 FROM d JOIN c ON d.d1 = c.c1 WHERE d.d1 = x
$$;
PREPARE p AS SELECT f($1::INT::STRING);
EXECUTE p(1);
The events/state that lead to the nil-pointer exception are:
PREPARE
the memo is "detached" and some pointers are set to nil
to allow the GC to clean up some objects. In this specific reproduction, resetting the memo's logicalPropsBuilder
is critical because this contains a pointer back to the memo. This pointer is set to nil
during detachment.scanGroup
) have a pointer to the memo that has been detached.PREPARE
the UDF is not inlined because it can only be inlined when it has constant, variable, or placeholder expressions. The double cast $1::INT::STRING
in the reproduction above prevents inlining in this case.EXECUTE
, the UDF is inlined when the casts are folded to a constant value. The inlined subquery contains expressions copied from the cached, prepared memo. Some expressions are not rebuilt because they do not contain placeholder expressions. They retain references to the detached memo.EXECUTE
, the original cached memo is accessed via the Memo()
method of some expressions that were not rebuilt. Fields of the cached memo, which were set to nil
during detachment, are accessed and cause nil-pointer exceptions. In this example, the NPE occurs when accessing the scan's memo here: https://github.com/cockroachdb/cockroach/blob/c7ecbe29fe222b1e1b8b5106238e160e06b3ce9b/pkg/sql/opt/xform/join_funcs.go#L402
Specifically, the logicalPropsBuilder
's statisticBuilder
is accessed which has a nil *opt.Metadata
.This does not reproduce for UDFs that are never inlined because each statement's expression is copied and re-optimized during evaluation of the UDF. This rebuilds the expressions with references to a non-detached memo.
Statements without UDFs are not affected by this, because when they are rebuilt during EXECUTE
in AssignPlaceholders
, we traverse the entire expression tree via CopyAndReplaceDefault
which rebuilds each expression with references to a non-detached memo. UDFs are the only expressions vulnerable to this problem because they contain relational expressions within their "private" field, instead of as a child of the expression, so those expressions are not traversed during CopyAndReplaceDefault
.
I see a few potential action items:
RelExpr.Memo()
where it is easy to do so. This is easy in some places (like the line linked above) because a reference to the correct memo is readily available. This will reduce the chance of triggering these NPEs, but it will not solve the problem entirely as long as some usage of RelExpr.Memo()
remains.RelExpr.Memo()
method and all references to a memo within expressions and expression groups. I spent some time reworking code to see if this was possible, and I believe it is, with some elbow grease. A few things become a bit less natural (e.g., we'll have to remove the String()
method of opt.Expr
), but the majority of changes are plumbing a *memo.Memo
or *opt.Metadata
around.UDFCallExpr
so that they can be rebuilt during AssignPlaceholders
with the correct references to the non-detached memo. @DrewKimball Are there any reasons you can think of why this would be a bad idea—I remember trying this and running into problems, though I don't remember specifics—maybe you tried it as well?Include a UDF body's relational expression as children in the UDFCallExpr so that they can be rebuilt during AssignPlaceholders with the correct references to the non-detached memo. @DrewKimball Are there any reasons you can think of why this would be a bad idea—I remember trying this and running into problems, though I don't remember specifics—maybe you tried it as well?
Moving the body out of the UDF private would require some optgen changes, since it doesn't have any conception of a slice of RelExpr
. I did play around with it before, and it wasn't trivial, though I don't remember details. We also need the ability to have a UDFCallExpr
without its body being built yet for recursive routines.
I think we should probably be copying the UDF body entirely when it gets inlined. Maybe we could keep some reference counter to avoid the copy in the happy case, but copying should probably be the default. I wonder if there are other potential bugs here beyond the issue with Memo()
.
We also need the ability to have a UDFCallExpr without its body being built yet for recursive routines.
Sounds like it'll be tricky to make sure we only copy the body expressions once even when there are multiple UDFCallExpr
s in the tree.
I wonder if there are other potential bugs here beyond the issue with Memo().
Good point. I wouldn't be surprised to find other dragons lurking here.
I'll see if I can come up with a reasonable way to copy the body expressions.
A simple reproduction from the report in #124004:
CREATE TABLE foo();
PREPARE p AS SELECT obj_description($1::REGCLASS::OID, 'pg_class');
EXECUTE p('foo');
This issue was autofiled by Sentry. It represents a crash or reported error on a live cluster with telemetry enabled.
Sentry link: https://cockroach-labs.sentry.io/issues/4212208041/?referrer=webhooks_plugin
Panic message:
Stacktrace (expand for inline code snippets):
https://github.com/cockroachdb/cockroach/blob/810d4f27a7f02b9cc2750cab654ed1c62ac3e75a/pkg/util/errorutil/catch.go#L28-L30 in pkg/util/errorutil.ShouldCatch https://github.com/cockroachdb/cockroach/blob/810d4f27a7f02b9cc2750cab654ed1c62ac3e75a/pkg/sql/opt/xform/optimizer.go#L247-L249 in pkg/sql/opt/xform.(*Optimizer).Optimize.func1 GOROOT/src/runtime/panic.go#L883-L885 in runtime.gopanic GOROOT/src/runtime/panic.go#L259-L261 in runtime.panicmem GOROOT/src/runtime/signal_unix.go#L834-L836 in runtime.sigpanic https://github.com/cockroachdb/cockroach/blob/810d4f27a7f02b9cc2750cab654ed1c62ac3e75a/pkg/sql/opt/table_meta.go#L551-L553 in pkg/sql/opt.(*Metadata).TableAnnotation https://github.com/cockroachdb/cockroach/blob/810d4f27a7f02b9cc2750cab654ed1c62ac3e75a/pkg/sql/opt/memo/statistics_builder.go#L588-L590 in pkg/sql/opt/memo.(*statisticsBuilder).makeTableStatistics https://github.com/cockroachdb/cockroach/blob/810d4f27a7f02b9cc2750cab654ed1c62ac3e75a/pkg/sql/opt/memo/statistics_builder.go#L282-L284 in pkg/sql/opt/memo.(*statisticsBuilder).availabilityFromInput https://github.com/cockroachdb/cockroach/blob/810d4f27a7f02b9cc2750cab654ed1c62ac3e75a/pkg/sql/opt/memo/statistics_builder.go#L792-L794 in pkg/sql/opt/memo.(*statisticsBuilder).buildScan https://github.com/cockroachdb/cockroach/blob/810d4f27a7f02b9cc2750cab654ed1c62ac3e75a/pkg/sql/opt/memo/logical_props_builder.go#L185-L187 in pkg/sql/opt/memo.(*logicalPropsBuilder).buildScanProps https://github.com/cockroachdb/cockroach/blob/810d4f27a7f02b9cc2750cab654ed1c62ac3e75a/bazel-out/k8-opt/bin/pkg/sql/opt/memo/expr.og.go#L19991-L19993 in pkg/sql/opt/memo.(*Memo).MemoizeScan https://github.com/cockroachdb/cockroach/blob/810d4f27a7f02b9cc2750cab654ed1c62ac3e75a/pkg/sql/opt/xform/join_funcs.go#L399-L401 in pkg/sql/opt/xform.(*CustomFuncs).generateLookupJoinsImpl https://github.com/cockroachdb/cockroach/blob/810d4f27a7f02b9cc2750cab654ed1c62ac3e75a/pkg/sql/opt/xform/join_funcs.go#L260-L262 in pkg/sql/opt/xform.(*CustomFuncs).GenerateLookupJoins https://github.com/cockroachdb/cockroach/blob/810d4f27a7f02b9cc2750cab654ed1c62ac3e75a/bazel-out/k8-opt/bin/pkg/sql/opt/xform/explorer.og.go#L829-L831 in pkg/sql/opt/xform.(*explorer).exploreInnerJoin https://github.com/cockroachdb/cockroach/blob/810d4f27a7f02b9cc2750cab654ed1c62ac3e75a/bazel-out/k8-opt/bin/pkg/sql/opt/xform/explorer.og.go#L27-L29 in pkg/sql/opt/xform.(*explorer).exploreGroupMember https://github.com/cockroachdb/cockroach/blob/810d4f27a7f02b9cc2750cab654ed1c62ac3e75a/pkg/sql/opt/xform/explorer.go#L184-L186 in pkg/sql/opt/xform.(*explorer).exploreGroup https://github.com/cockroachdb/cockroach/blob/810d4f27a7f02b9cc2750cab654ed1c62ac3e75a/pkg/sql/opt/xform/optimizer.go#L535-L537 in pkg/sql/opt/xform.(*Optimizer).optimizeGroup https://github.com/cockroachdb/cockroach/blob/810d4f27a7f02b9cc2750cab654ed1c62ac3e75a/pkg/sql/opt/xform/optimizer.go#L298-L300 in pkg/sql/opt/xform.(*Optimizer).optimizeExpr https://github.com/cockroachdb/cockroach/blob/810d4f27a7f02b9cc2750cab654ed1c62ac3e75a/pkg/sql/opt/xform/optimizer.go#L577-L579 in pkg/sql/opt/xform.(*Optimizer).optimizeGroupMember https://github.com/cockroachdb/cockroach/blob/810d4f27a7f02b9cc2750cab654ed1c62ac3e75a/pkg/sql/opt/xform/optimizer.go#L522-L524 in pkg/sql/opt/xform.(*Optimizer).optimizeGroup https://github.com/cockroachdb/cockroach/blob/810d4f27a7f02b9cc2750cab654ed1c62ac3e75a/pkg/sql/opt/xform/optimizer.go#L298-L300 in pkg/sql/opt/xform.(*Optimizer).optimizeExpr https://github.com/cockroachdb/cockroach/blob/810d4f27a7f02b9cc2750cab654ed1c62ac3e75a/pkg/sql/opt/xform/optimizer.go#L577-L579 in pkg/sql/opt/xform.(*Optimizer).optimizeGroupMember https://github.com/cockroachdb/cockroach/blob/810d4f27a7f02b9cc2750cab654ed1c62ac3e75a/pkg/sql/opt/xform/optimizer.go#L522-L524 in pkg/sql/opt/xform.(*Optimizer).optimizeGroup https://github.com/cockroachdb/cockroach/blob/810d4f27a7f02b9cc2750cab654ed1c62ac3e75a/pkg/sql/opt/xform/optimizer.go#L298-L300 in pkg/sql/opt/xform.(*Optimizer).optimizeExpr https://github.com/cockroachdb/cockroach/blob/810d4f27a7f02b9cc2750cab654ed1c62ac3e75a/pkg/sql/opt/xform/optimizer.go#L577-L579 in pkg/sql/opt/xform.(*Optimizer).optimizeGroupMember https://github.com/cockroachdb/cockroach/blob/810d4f27a7f02b9cc2750cab654ed1c62ac3e75a/pkg/sql/opt/xform/optimizer.go#L522-L524 in pkg/sql/opt/xform.(*Optimizer).optimizeGroup https://github.com/cockroachdb/cockroach/blob/810d4f27a7f02b9cc2750cab654ed1c62ac3e75a/pkg/sql/opt/xform/optimizer.go#L298-L300 in pkg/sql/opt/xform.(*Optimizer).optimizeExpr https://github.com/cockroachdb/cockroach/blob/810d4f27a7f02b9cc2750cab654ed1c62ac3e75a/pkg/sql/opt/xform/optimizer.go#L620-L622 in pkg/sql/opt/xform.(*Optimizer).optimizeScalarExpr https://github.com/cockroachdb/cockroach/blob/810d4f27a7f02b9cc2750cab654ed1c62ac3e75a/pkg/sql/opt/xform/optimizer.go#L310-L312 in pkg/sql/opt/xform.(*Optimizer).optimizeExpr https://github.com/cockroachdb/cockroach/blob/810d4f27a7f02b9cc2750cab654ed1c62ac3e75a/pkg/sql/opt/xform/optimizer.go#L620-L622 in pkg/sql/opt/xform.(*Optimizer).optimizeScalarExpr https://github.com/cockroachdb/cockroach/blob/810d4f27a7f02b9cc2750cab654ed1c62ac3e75a/pkg/sql/opt/xform/optimizer.go#L310-L312 in pkg/sql/opt/xform.(*Optimizer).optimizeExpr https://github.com/cockroachdb/cockroach/blob/810d4f27a7f02b9cc2750cab654ed1c62ac3e75a/pkg/sql/opt/xform/optimizer.go#L620-L622 in pkg/sql/opt/xform.(*Optimizer).optimizeScalarExprv23.1.2
Jira issue: CRDB-28306