cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.92k stars 3.78k forks source link

sql: DistSQL processors can use a Leaf txn without collecting its metadata #41222

Open andreimatei opened 4 years ago

andreimatei commented 4 years ago

When using a Leaf txn, someone needs to collect its metadata and pass it through to the DistSQL receiver which merges it with the Root's metadata. A handful of processors (e.g. the TableReader) do this by collecting the leaf metadata when draining. However, this was only done for processors directly use the transaction. But I believe any processor can use the transaction through its "render expressions" and such by invoking built-in functions. These functions can use the transaction through the EvalCtx. And so I think it might be possible for a txn to be used without anyone collecting its metadata.

Perhaps at the moment that's not actually happening because processors share their transaction objects (and also share it with the EvalCtx), but I'm moving towards less sharing. So the thing, if not broken already, is very fragile.

Separately, our collecting of txn metadata is very haphazard. It's possible for multiple processors to collect the same piece of metadata from a shared txn. We should figure out a more principled story.

Jira issue: CRDB-5476

knz commented 4 years ago

@andreimatei can you explain more what the effects of "not collecting the metadata" can be. Does this mean that we can "lose a read" and fail to populate the ts cache, or something similar? Is it possible to observe transaction anomalies, for example a query to system.descriptors via a built-in function fail to mark the txn to serialize properly with concurrent DDL? Are there other risks?

(I am trying to assess severity for the release)

cc @jordanlewis

andreimatei commented 4 years ago

Technically the consequences of not collecting that metadata are that the transaction can succeed in refreshing without actually refreshing the reads performed through the leaf transaction whose metadata was not collected. This can result in write skew - so a failure of serializability (albeit not the worst one conceivable).

Now, I'm not completely sure that problems can manifest themselves today. In order for badness to occur, you need to a) run one of these builtin functions that uses the txn in a leaf txn and b) not collect the metadata for that leaf txn. It might be the case that today we plan everything that uses such builtins entirely on the gateway (I'm not sure). Or even if we don't plan them on the gateway, as long as there is a TableReader using the same leaf as the builtin, then we'll end up collecting the metadata. So I'm not sure that you can actually construct a scenario where bad things happen. That's why I haven't assigned a severity. We shouldn't consider this a release blocker. But I do think the situation is precarious at best.

I've filed the issue so I can reference it from new code in #41102. That PR might have been making things worse by introducing more leaves on the gateway. But now the future of that PR seems uncertain.

knz commented 4 years ago

The main thing I'm worried about is things like pg_table_is_visible and others that look at descriptors. These need proper isolation and consistency in relation to DDL occuring in unrelated txns, otherwise ORMs will fail in mysterious and hard-to-debug ways.

github-actions[bot] commented 3 years ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 5 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!

knz commented 3 years ago

still current

github-actions[bot] commented 1 year ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!

knz commented 1 year ago

@yuzefovich @rharding6373 @DrewKimball I think we fixed this right?

yuzefovich commented 1 year ago

No, I think it's still current. In particular, this part

But I believe any processor can use the transaction through its "render expressions" and such by invoking built-in functions. These functions can use the transaction through the EvalCtx. And so I think it might be possible for a txn to be used without anyone collecting its metadata.

requires further investigation to see whether there are cases like this.

I believe that

Perhaps at the moment that's not actually happening because processors share their transaction objects (and also share it with the EvalCtx), but I'm moving towards less sharing. So the thing, if not broken already, is very fragile.

saves us in vast majority of cases. Namely, we generally use a single Txn object for the whole flow (modulo the streamer component), and most likely there will be a processor in that flow that will collect the txn's metadata even if the txn is used by another processor. However, it doesn't seem impossible to have a plan where that doesn't happen.