cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.84k stars 3.77k forks source link

schema: potential 2X index backfill perf regression during `tpce init` #95163

Open msbutler opened 1 year ago

msbutler commented 1 year ago

Describe the problem

When init'ing the tpce/custumers=2m workload on 22.2, two large index backfills took a total of 40 hrs to add 10TB of replicated physical data to the cluster, which is double the amount of time reported in the tpce eval for the same workload/hardware setup on 22.1. Put another way, this 15 node/48 vcpu cluster performed these backfills with a throughput of 5 MB/S/Node when no foreground workload was running.

NOTE: during discussion below, I realized that I left the COCKROACH_ROCKSDB_CONCURRENCY set to the default of 4, while during Nathan's run with 22.1, he increased this env var to 16. Had I also done this, the regression probably would been alleviated. That being said, this backfill with default settings, is still slow as heck. A throughput of 5 - 10 mb/s/node is about 10 to 20 x slower than a comparable restore or import of this size.

To Reproduce

Follow the repro steps at the bottom of the tpce eval doc. During the tpce init cmd, the following indices will be created:

// took 35 hrs (the particularly problematic one)
CREATE INDEX ON tpce.public.trade (t_s_symb, t_dts ASC) STORING (t_ca_id, t_exec_name, t_is_cash, t_trade_price, t_qty, t_tt_id)

// took 7 hrs
CREATE INDEX ON tpce.public.trade (t_ca_id, t_dts DESC) STORING (t_st_id, t_tt_id, t_is_cash, t_s_symb, t_qty, t_bid_price, t_exec_name, t_trade_price, t_chrg)

Initial Investigation of the first backfill

This investigation leaves with me with 3 questions, in order of importance:

To quote Sumeer, here are two places to look next:

Jira issue: CRDB-23380

msbutler commented 1 year ago

here's a link to a debug zip while this sawtooth patter is occuring.

ajwerner commented 1 year ago

This is a real problem, but I don't think Schema is suited to do anything about it. As far as I can see, there's something about excessive backpressure and that maybe has something to do with bad balancing of load over nodes? Maybe the allocator should be taking the ingestion rate into consideration and it should be moving replicas away from n12?

@kvoli maybe you have thoughts on that? I don't really know exactly where to direct this next. @dt as the former expert, do you have ideas?

msbutler commented 1 year ago

one thread schema could pursue: during job or sql processor planning, is sql work properly distributed across sql processors? I.e., in the context of this huge backfill, could the sql processor on n12 have received an inordinate amount of work? In the accompanying slack thread, there was some evidence that something at the sql level could be responsible for the misdistribution of work.

Further, as part of this joint investigation, it would be great if someone on schema could lead a walkthrough of the backfill code.

msbutler commented 1 year ago

Also, to your question:

Maybe the allocator should be taking the ingestion rate into consideration and it should be moving replicas away from n12

I should note that disk usage across the nodes was evenly distributed throughout the job, which implies replicas were being rebalanced away from n12.

ajwerner commented 1 year ago

The new schema changer and the old schema changer use the same execution and planning logic for index backfills. The new schema changer invokes this logic which is a direct port. That thing uses the same distsql planner and the same execution plan:

https://github.com/cockroachdb/cockroach/blob/bb1b8a1f26e2cd1f5c91b24ecd7810aa92167dd4/pkg/sql/index_backfiller.go#L153-L206

jbowens commented 1 year ago

"if the call to PreIngestDelay is coming from addSSTablePreApply then the delaying is happening below raft, which could affect a replica even if the overloaded store is only a follower. That may widen the effect of a single overloaded store. There are other things involved like the raft proposalQuota pool etc., and there may be KV metrics to look at whether this is happening (KV folks would know)."

@sumeerbhola do we still need PreIngestDelay now that we have admission control? what would we need to remove it? #86857?

one could potentially speed things up by setting "COCKROACH_ROCKSDB_CONCURRENCY to a much higher value (the default is 4), to increase the number of concurrent compactions."

In your run was the compaction concurrency left at the default? For larger nodes the default compaction concurrency is way too low. Nathan bumped the compaction concurrency in the tpc-e eval: https://docs.google.com/document/d/1wzkBXaA3Ap_daMV1oY1AhQqlnAjO3pIVLZTXY53m0Xk/edit#heading=h.2fazvsji9uxb

msbutler commented 1 year ago

In your run was the compaction concurrency left at the default?

Ah yes, I did leave it by default. This could explain the perf regression relative to Nathan's index creation.

sumeerbhola commented 1 year ago

@sumeerbhola do we still need PreIngestDelay now that we have admission control? what would we need to remove it? https://github.com/cockroachdb/cockroach/issues/86857?

We don't have admission control protection for followers. @irfansharif is implementing replication admission control (which should land for v23.1) and I would expect that in a future release we would change the default for the cluster settings for this resource-unaware throttle such that it becomes a noop.

jbowens commented 1 year ago

@msbutler should we keep this open, or is it good to close given the compaction concurrency apples vs oranges comparison?

msbutler commented 1 year ago

I'm not sure. There may not be a regression, but there exists a larger issue that under default settings, a 22.2 cluster with no foreground workload, has extremely slow index backfill performance here. Write throughput of 5 mb / node /second is between 10 and 20x slower than a restore and/or import of the same size. Given the three open questions I posed (see below), I think the issue is probably in schema or KVs court. I think they should decide to further investigate or close the issue. I wish I could spend more time investigating, but I currently don't have capacity right now.

The bursty workload across all nodes: If one or two stores are throttling addSSTable requests, why do all nodes seem to get throttled around the same time for the same amount of time? I’d hope that if only one node (e.g. n12) gets throttled, other kv servers could continue to process requests. That doesn’t seem to be the case-- they all get blocked, even if their l0-sublevel count is below 20. I wonder if this is due to KV or something in the dist sql processors. N12's extra work: why is n12 receiving extra work? I wonder if this has something to do with the index backfill dist-sql processors. L0 Write rate: Squinting at at the ratio of addSSTable.Recv to l0-numfiles, it seems 25% of add sstables are sadly written to L0 instead of straight to L6, across all nodes. If we could lower this, the problem would be less acute. Is the ratio a surprise? Idk.

jbowens commented 1 year ago

I'll link this to cockroachdb/pebble#1329, our issue for resource utilization aware compaction concurrency.