*: internal executor paradigm is bad for global latency (parallel txn)

ajwerner commented 3 years ago

Is your feature request related to a problem? Please describe.

This ends up being a big issue arguing that the lack of support for parallelism in our *kv.Txn is making it very hard for us to write reasonable-latency global features. At one point the *kv.Txn was believed to be safe, but that was buggy (https://github.com/cockroachdb/cockroach/issues/17197).

Effectively all new stateful features added to cockroach use sql tables and, increasingly, use the sql to interact with that state. The problem is we tend to build this abstractions without thinking about latency implications and latency implications are really freaking hard to think about in the programming model.

One thing we sort of have to help but has not thus far been super reliable is the ddl_analysis benchmark suite (https://github.com/cockroachdb/cockroach/pull/50953). This isn't general and isn't really useful for internal things. It also doesn't deal with round-trips due to replication which system tables almost always experience.

This use of SQL is pretty great for many reasons; we get to use sql abstractions and we get to dogfood (to an extent; the internal executor isn't exactly the same thing the client experiences, but close). Then, if you look at all of the above packages, we end up building the wrappers around these sql tables as reasonably clean and well defined abstractions. The problem with these abstractions is they almost always entail executing at least one sql statement synchronously. That means, in the normal course of operation when we need to do something in a loop involving one of these subsystems, we incur at least N global round-trips.

There are a few different an important points to this:

1) The *kv.Txn API does not allow parallel writes (and only allows parallel reads if you jump through some hoops) 2) We tend to scatter the placement of leases and replicas over all regions for system tables.

This may be getting better with multi-region.

Describe the solution you'd like

I think there's a few different approaches we could take. They range in how radical they are.

1) Exploit parallelism within a transaction.

This could do quite a bit. It doesn't give an opportunity for 1PC but that's probably okay.
The problem with this approach is that not all code is easily structured to be parallel. 2) Ship code close to the leaseholders.
This doesn't fundamentally solve the problem 3) Change the programming model to enable asynchronous batching.
This might look to be loosely related to https://github.com/cockroachdb/cockroach/issues/31055
There's a lot of rethinking to do to make this sort of thing work where you'd enqueue some work to be done and then have some callback or other sort of cooperative continuation to put it together.
There isn't really much evidence that something callback oriented is any better than just going parallel and the coordinating between parallel processors (I think there's some theorem about equivalence here 🤔).

So, parallelism is probably our best way out; it certainly is the best fit to the programming model we're used to.

Additional context

Another important point is that these long-running transactions have a tendency to interfere with user queries to introspect this state. Use of global transactions may be the answer here!

There are some approaches to mitigate that like https://github.com/cockroachdb/cockroach/issues/35712 which might enable some amount of user introspection. That's what we're ultimately doing in a more hacky way in https://github.com/cockroachdb/cockroach/pull/60953.

Subsystems I have in mind:

jobs
protected timestamps
sql liveness
table stats
replication reports
statement bundles
... the list goes on

Jira issue: CRDB-3094

cucaroach commented 1 year ago

Adding fuel to the fire here. I'm trying to optimize COPY and I don't think I can do a decent job w/o parallel writes. For a well dressed table like tpch.lineitem (8 indexes) I think we'll only achieve meaningful speedups with parallel writes. Parsing strings into datums is ~%5 of the time, massaging datums into kv requests is like ~%10 and %85 is kv (%5 kvclient/%20 kvserver/%50 storage/%10 "other"). So to get meaningful speedups I need to push parallel kv requests. IMPORT gets the job done by slicing and dicing and building SSTs which works at big scales but we want small and medium sized COPYs to be fast and be able to overlap with existing keys. We also don't want COPY go crazy with concurrency and resource consumption for chunks of work that don't justify all the coordination. So current thinking is that a nice middle ground would be to use separate goroutines to write to primary table and each index. This should minimize work spent splitting up kv batches into replica traffic and maximize speedups from overlapping writes to separate replicas. No reason not to exploit same approach for all batched inserts I think.

The only alternative is to go back to the ugly days of non-atomic COPY and just parallelize writing chunks of the COPY rows but that's a step backwards and will require an opt in with scary THIS ISNT TRANSACTIONAL warnings that will probably scare off most users.

Note that parallel writes achieved through a nested transaction or root/leaf model would be just fine as long as the COPY is atomic.

rafiss commented 1 year ago

The only alternative is to go back to the ugly days of non-atomic COPY and just parallelize writing chunks of the COPY rows but that's a step backwards and will require an opt in with scary THIS ISNT TRANSACTIONAL warnings that will probably scare off most users.

Can we make the non-transactional behavior opt-in? For example, with https://github.com/cockroachdb/cockroach/issues/85573 or some other CRDB-specific option for COPY.

ajwerner commented 1 year ago

I think there's a bright future behind deferring the writes. I don't quite know how to do it, but it seems possible in the case of copy to on some level coax the execution to just buffering the writes and running them later. I think it gets complex for fancier executions like fk checks.

cucaroach commented 1 year ago

Can we make the non-transactional behavior opt-in? For example, with #85573 or some other CRDB-specific option for COPY.

We entertained that notion for 22.2 but dismissed it, I really think its a step in the wrong direction but if parallel local-only writes are too big a lift for the near future we can reconsider.

Need to bone up on this deferred write concept...

ajwerner commented 1 year ago

I think we need to get the KV team involved in the discussions.

ajwerner commented 7 months ago

This discussion is interesting enough as a historical artifact, but this issue isn't helping anybody at this point.

cockroachdb / cockroach

*: internal executor paradigm is bad for global latency (parallel txn) #60968