Open eivanov89 opened 9 months ago
Hello, I am Blathers. I am here to help you get the issue triaged.
It looks like you have not filled out the issue in the format of any of our templates. To best assist you, we advise you to use one of these templates.
I have CC'd a few people who may be able to assist you:
If we have not gotten back to your issue within a few business days, you can try the following:
:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.
cc @cockroachdb/cdc
cc @cockroachdb/replication
Just got another type of error during 12K run:
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
3596.8s 0 262.6 251.7 60.8 83.9 88.1 100.7 delivery
3596.8s 0 2555.1 2518.0 30.4 44.0 52.4 88.1 newOrder
3596.8s 0 251.6 251.7 8.9 13.1 14.7 21.0 orderStatus
3596.8s 0 2551.1 2517.3 16.8 26.2 31.5 100.7 payment
3596.8s 0 254.6 251.8 16.8 24.1 27.3 39.8 stockLevel
3597.8s 0 247.6 251.7 62.9 96.5 113.2 121.6 delivery
3597.8s 0 2571.5 2518.0 30.4 48.2 60.8 83.9 newOrder
3597.8s 0 228.6 251.7 8.4 13.6 17.8 21.0 orderStatus
3597.8s 0 2567.5 2517.3 16.8 32.5 52.4 71.3 payment
3597.8s 0 234.6 251.8 16.3 24.1 30.4 41.9 stockLevel
3603.6s 0 9.3 251.4 60.8 75.5 15032.4 15032.4 delivery
3603.6s 0 89.4 2514.4 29.4 39.8 10200.5 15032.4 newOrder
3603.6s 0 8.5 251.3 8.4 12.1 14.7 15032.4 orderStatus
3603.6s 0 91.7 2513.7 16.3 23.1 10200.5 15032.4 payment
3603.6s 0 9.4 251.4 15.7 11274.3 15032.4 15032.4 stockLevel
3613.6s 0 37.1 250.7 15032.4 15569.3 15569.3 15569.3 delivery
3613.6s 0 136.1 2507.4 15032.4 15569.3 15569.3 15569.3 newOrder
3613.6s 0 123.7 250.6 11274.3 15032.4 15032.4 15032.4 orderStatus
3613.6s 0 439.1 2506.7 11274.3 15032.4 15032.4 15569.3 payment
3613.6s 0 86.6 250.7 10200.5 11274.3 12348.0 12348.0 stockLevel
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
3613.8s 0 0.0 250.6 0.0 0.0 0.0 0.0 delivery
3613.8s 0 8.3 2507.2 15569.3 15569.3 15569.3 15569.3 newOrder
3613.8s 0 4.2 250.6 5637.1 5637.1 5637.1 5637.1 orderStatus
3613.8s 0 83.2 2506.6 11811.2 12348.0 12348.0 12348.0 payment
3613.8s 0 20.8 250.7 5637.1 11274.3 11274.3 11274.3 stockLevel
3614.8s 0 47.0 250.6 11274.3 16643.0 16643.0 16643.0 delivery
3614.8s 0 193.9 2506.6 11274.3 16106.1 16643.0 16643.0 newOrder
3614.8s 0 328.7 250.6 10737.4 11274.3 11274.3 11811.2 orderStatus
3614.8s 0 509.4 2506.0 11274.3 12884.9 13421.8 15569.3 payment
3614.8s 0 240.9 250.7 10737.4 11274.3 11274.3 11811.2 stockLevel
3615.8s 0 104.8 250.6 11811.2 12348.0 13421.8 13421.8 delivery
3615.8s 0 916.5 2506.1 12348.0 13421.8 14495.5 18253.6 newOrder
3615.8s 0 167.3 250.6 11811.2 12348.0 12348.0 12348.0 orderStatus
3615.8s 0 827.1 2505.5 11811.2 12348.0 12348.0 13958.6 payment
3615.8s 0 182.7 250.7 11274.3 12348.0 12348.0 12348.0 stockLevel
3616.8s 0 102.8 250.5 8589.9 13421.8 13958.6 14495.5 delivery
3616.8s 0 1686.9 2505.9 12884.9 13958.6 15032.4 15569.3 newOrder
3616.8s 0 121.0 250.6 8589.9 12884.9 12884.9 12884.9 orderStatus
3616.8s 0 1551.8 2505.3 12348.0 13421.8 13421.8 14495.5 payment
3616.8s 0 162.3 250.6 8589.9 12884.9 13421.8 13958.6 stockLevel
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
3617.8s 0 75.5 250.5 14495.5 14495.5 14495.5 14495.5 delivery
3617.8s 0 479.8 2505.4 14495.5 14495.5 16643.0 16643.0 newOrder
3617.8s 0 16.7 250.5 9663.7 9663.7 10200.5 10200.5 orderStatus
3617.8s 0 408.1 2504.7 9663.7 14495.5 14495.5 16643.0 payment
3617.8s 0 9.8 250.6 9663.7 14495.5 14495.5 14495.5 stockLevel
3618.8s 0 132.2 250.4 10737.4 15032.4 15569.3 20401.1 delivery
3618.8s 0 1160.2 2505.0 10737.4 15032.4 16106.1 17179.9 newOrder
3618.8s 0 83.4 250.5 10200.5 15032.4 15032.4 15032.4 orderStatus
3618.8s 0 1167.3 2504.3 10200.5 15032.4 15032.4 16643.0 payment
3618.8s 0 76.3 250.5 10200.5 15032.4 15032.4 15032.4 stockLevel
3619.8s 0 2.0 250.4 11274.3 13421.8 13421.8 13421.8 delivery
3619.8s 0 7.1 2504.3 11274.3 15569.3 15569.3 15569.3 newOrder
3619.8s 0 0.0 250.4 0.0 0.0 0.0 0.0 orderStatus
3619.8s 0 8.1 2503.6 10737.4 11274.3 11274.3 11274.3 payment
3619.8s 0 0.0 250.5 0.0 0.0 0.0 0.0 stockLevel
3620.8s 0 0.0 250.3 0.0 0.0 0.0 0.0 delivery
3620.8s 0 0.0 2503.6 0.0 0.0 0.0 0.0 newOrder
3620.8s 0 0.0 250.3 0.0 0.0 0.0 0.0 orderStatus
3620.8s 0 0.0 2502.9 0.0 0.0 0.0 0.0 payment
3620.8s 0 0.0 250.4 0.0 0.0 0.0 0.0 stockLevel
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
3621.8s 0 0.0 250.2 0.0 0.0 0.0 0.0 delivery
3621.8s 0 0.0 2502.9 0.0 0.0 0.0 0.0 newOrder
3621.8s 0 0.0 250.2 0.0 0.0 0.0 0.0 orderStatus
3621.8s 0 0.0 2502.2 0.0 0.0 0.0 0.0 payment
3621.8s 0 0.0 250.3 0.0 0.0 0.0 0.0 stockLevel
3622.8s 0 1.0 250.2 20401.1 20401.1 20401.1 20401.1 delivery
3622.8s 0 3.0 2502.2 21474.8 22548.6 22548.6 22548.6 newOrder
3622.8s 0 0.0 250.2 0.0 0.0 0.0 0.0 orderStatus
3622.8s 0 0.0 2501.6 0.0 0.0 0.0 0.0 payment
3622.8s 0 0.0 250.3 0.0 0.0 0.0 0.0 stockLevel
3623.8s 0 18.0 250.1 16106.1 20401.1 20401.1 20401.1 delivery
3623.8s 0 30.1 2501.5 20401.1 22548.6 23622.3 23622.3 newOrder
3623.8s 0 31.1 250.1 15569.3 20401.1 20401.1 20401.1 orderStatus
3623.8s 0 68.2 2500.9 16106.1 20401.1 20401.1 20401.1 payment
3623.8s 0 18.0 250.2 16106.1 20401.1 20401.1 20401.1 stockLevel
3624.8s 0 45.4 250.0 21474.8 22548.6 26843.5 26843.5 delivery
3624.8s 0 255.9 2500.9 21474.8 22548.6 23622.3 24696.1 newOrder
3624.8s 0 25.7 250.1 16643.0 21474.8 21474.8 21474.8 orderStatus
3624.8s 0 77.1 2500.2 21474.8 21474.8 21474.8 22548.6 payment
3624.8s 0 23.7 250.1 16643.0 21474.8 21474.8 21474.8 stockLevel
Error: error in payment: ERROR: result is ambiguous: error=ba: EndTxn(parallel commit) [/Table/106/1/1728/0], [txn: 3bb86273] RPC error: grpc: error reading from server: read tcp [2a02:6b8:c34:14:0:5a59:eb1f:272e]:37384->[2a02:6b8:c34:14:0:5a59:eb1f:295a]:26259: use of closed network connection [code 14/Unavailable] [propagate] (last error: TransactionRetryError: retry txn (RETRY_SERIALIZABLE): "sql txn" meta={id=3bb86273 key=/Table/106/1/1728/0 iso=Serializable pri=0.00987058 epo=0 ts=1709652515.303041731,0 min=1709652504.989393852,0 seq=5} lock=true stat=PENDING rts=1709652504.989393852,0 wto=false gul=1709652505.489393852,0) (SQLSTATE 40003)
In 23.1.10 we were able to run 12K during 12 hours (and did many iterations). Might be a regression in v23.2.2.
And one more for 12K:
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
3624.8s 0 5.9 250.5 92.3 11274.3 11274.3 11274.3 delivery
3624.8s 0 132.9 2509.1 13421.8 21474.8 21474.8 21474.8 newOrder
3624.8s 0 27.8 250.7 4563.4 9126.8 10200.5 10200.5 orderStatus
3624.8s 0 87.2 2506.2 10200.5 13421.8 21474.8 21474.8 payment
3624.8s 0 22.8 250.7 5100.3 11274.3 11274.3 11274.3 stockLevel
3625.8s 0 234.6 250.5 7247.8 13421.8 13421.8 13421.8 delivery
3625.8s 0 1738.3 2508.9 11811.2 13958.6 21474.8 22548.6 newOrder
3625.8s 0 158.0 250.7 2818.6 9126.8 9663.7 11274.3 orderStatus
3625.8s 0 1269.1 2505.8 7247.8 13421.8 13958.6 22548.6 payment
3625.8s 0 127.0 250.7 3623.9 11274.3 13421.8 13958.6 stockLevel
3626.8s 0 273.2 250.5 11274.3 15032.4 15569.3 15569.3 delivery
3626.8s 0 1524.5 2508.7 12884.9 15569.3 23622.3 23622.3 newOrder
3626.8s 0 232.9 250.7 2684.4 12348.0 12884.9 12884.9 orderStatus
3626.8s 0 1392.4 2505.5 6442.5 15032.4 15569.3 23622.3 payment
3626.8s 0 244.0 250.7 4563.4 12884.9 14495.5 14495.5 stockLevel
3627.8s 0 325.8 250.5 8589.9 15569.3 16106.1 23622.3 delivery
3627.8s 0 2614.5 2508.7 12348.0 16106.1 23622.3 24696.1 newOrder
3627.8s 0 339.6 250.7 3489.7 10200.5 12348.0 13958.6 orderStatus
3627.8s 0 1522.2 2505.3 10200.5 15569.3 16106.1 24696.1 payment
3627.8s 0 327.8 250.7 4563.4 13958.6 15569.3 16643.0 stockLevel
Error: error in payment: ERROR: result is ambiguous: error=ba: EndTxn(parallel commit) [/Table/106/1/9595/0], [txn: 13b9d900] RPC error: grpc: error reading from server: read tcp [2a02:6b8:c34:14:0:5a59:eb1f:2afe]:33976->[2a02:6b8:c34:14:0:5a59:eb1f:272e]:26258: use of closed network connection [code 14/Unavailable] [propagate] (last error: TransactionAbortedError(ABORT_REASON_ABORT_SPAN): "sql txn" meta={id=13b9d900 key=/Table/106/1/9595/0 iso=Serializable pri=0.01199064 epo=0 ts=1709668789.912520706,6 min=1709668789.912520706,6 seq=4} lock=true stat=ABORTED rts=1709668789.912520706,6 wto=false gul=1709668790.412520706,6) (SQLSTATE 40003)
All happen almost same time after 30M of rump and 1 hour of execution.
Neither 10K works:
3629.9s 0 17.2 208.4 17179.9 18253.6 18253.6 18253.6 orderStatus
3629.9s 0 394.3 2083.3 16106.1 20401.1 27917.3 27917.3 payment
3629.9s 0 20.2 208.4 17179.9 19327.4 19327.4 19327.4 stockLevel
Error: error in payment: ERROR: result is ambiguous: error=failed to connect to n5 at ydb-vla-dev04-006.search.yandex.net:26259: operation "conn heartbeat" timed out after 6.134s (given timeout 6s): grpc: context deadline exceeded [code 4/DeadlineExceeded] [propagate] (last error: TransactionRetryError: retry txn (RETRY_SERIALIZABLE): "sql txn" meta={id=0c1857fb key=/Table/106/1/4220/0 iso=Serializable pri=0.02181940 epo=0 ts=1709677455.990413313,22 min=1709677439.787321858,0 seq=5} lock=true stat=PENDING rts=1709677449.113683691,1 wto=false gul=1709677440.287321858,0) (SQLSTATE 40003)
Hi @eivanov89, thank you for filing the detailed report! The symptoms you describe do look vaguely like overload, but the cause of the overload is not obvious, so we should dig in. The pattern you noticed with each failure happening at almost exactly 1 hour of execution is suspicious and is likely related.
I was able to reproduce what you are seeing using the following steps and our roachprod
tool:
roachprod create nathan-tpcc -n 13 --gce-machine-type=n2-standard-32
roachprod stage nathan-tpcc release v23.2.2
roachprod start nathan-tpcc:1-12
roachprod adminurl --open nathan-tpcc:1
roachprod run nathan-tpcc:13 -- './cockroach workload fixtures import tpcc --warehouses=13000 {pgurl:1}'
roachprod run nathan-tpcc:13 -- './cockroach workload run tpcc --ramp=30m --duration=720m --warehouses=13000 {pgurl:1-12}'
At the hour mark, I also see a spike in latency and a drop in throughput. In my case, the workload kept running, but the impact is just as clear:
Looking at the SQL dashboard, we see a spike in SQL connections right at this time:
Given the roundness of the timing of this event (exactly 1 hour into the workload) and the spike in SQL connections, I have a strong suspicion that this has something to do with the workload re-establishing all SQL connections at the hour. It may be due to the pgx driver's (a recent addition to the tpc-c workload) MaxConnLifetime setting, which defaults to 1 hour. Something similar to the issue fixed by https://github.com/cockroachdb/cockroach/commit/bdf3f6262e380b867aa8884016eae53d28d074a4. It may also be something else.
Either way, this looks like a deficiency in the tpc-c workload and not in cockroachdb itself. I'll try to understand better tomorrow. Thank you again for flagging.
Can you change the connection max age to 5min and add a 30s connection lifetime jitter to the connection URL? We've chased this client-side artifact before and I wonder if this is the same phenomena.
Can you change the connection max age to 5min and add a 30s connection lifetime jitter to the connection URL? We've chased this client-side artifact before and I wonder if this is the same phenomena.
Sure. May I ask you to provide the precise setting names and the way to modify them in your tpc-c impl?
Sure. May I ask you to provide the precise setting names and the way to modify them in your tpc-c impl?
Add the following CLI options to the workload binary, --max-conn-lifetime 5m --max-conn-lifetime-jitter 30s
, e.g.,
./cockroach workload run tpcc --ramp=30m --duration=720m --warehouses=13000 --max-conn-lifetime 5m --max-conn-lifetime-jitter 30s $pgurls
These are common connection options shared by all the workloads: https://github.com/cockroachdb/cockroach/blob/7ff7978ba43b4c4b7dec6b55206b84cc9b17ffcc/pkg/workload/connection.go#L47-L54
Thank you for clarifying, Stan. I tried, but the result is the same.
Loaded 15K warehouses
./cockroach workload fixtures import tpcc --warehouses=15000 'postgres://root@ydb-vla-dev04-000:26257?sslmode=disable
During the load I reproduced unrelated issue #110738.
Then started:
./cockroach workload run tpcc --ramp=30m --duration=720m --warehouses=15000 --max-conn-lifetime 5m --max-conn-lifetime-jitter 30s $crdb_addr
...
3598.9s 0 264.5 312.7 167.8 369.1 436.2 570.4 delivery
3598.9s 0 2517.1 3128.8 130.0 402.7 536.9 906.0 newOrder
3598.9s 0 282.6 312.8 9.4 25.2 41.9 52.4 orderStatus
3598.9s 0 2664.3 3128.6 65.0 318.8 419.4 637.5 payment
3598.9s 0 313.6 312.7 19.9 44.0 151.0 570.4 stockLevel
3601.0s 0 5.6 312.6 5100.3 5637.1 5637.1 17179.9 delivery
3601.0s 0 57.6 3127.2 5100.3 17179.9 17179.9 17179.9 newOrder
3601.0s 0 3.3 312.7 7.9 12348.0 13421.8 13421.8 orderStatus
3601.0s 0 54.2 3127.0 5100.3 7516.2 17179.9 18253.6 payment
3601.0s 0 3.3 312.6 18.9 12348.0 13421.8 13421.8 stockLevel
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
3616.3s 0 50.9 311.3 7516.2 18253.6 18253.6 18253.6 delivery
3616.3s 0 876.4 3114.1 17179.9 18253.6 18253.6 18253.6 newOrder
3616.3s 0 260.1 311.4 10200.5 13421.8 13421.8 13421.8 orderStatus
3616.3s 0 373.2 3113.9 7516.2 18253.6 18253.6 18253.6 payment
3616.3s 0 282.7 311.3 12348.0 13421.8 17179.9 17179.9 stockLevel
3616.9s 0 3.2 311.2 18253.6 18253.6 18253.6 18253.6 delivery
3616.9s 0 55.5 3113.5 18253.6 18253.6 18253.6 18253.6 newOrder
3616.9s 0 250.7 311.3 10737.4 13421.8 13958.6 13958.6 orderStatus
3616.9s 0 380.8 3113.4 13421.8 18253.6 18253.6 18253.6 payment
3616.9s 0 269.8 311.2 10737.4 13958.6 13958.6 13958.6 stockLevel
3617.9s 0 170.4 311.2 11811.2 15032.4 19327.4 19327.4 delivery
3617.9s 0 856.1 3112.9 14495.5 18253.6 19327.4 19327.4 newOrder
3617.9s 0 835.0 311.5 11811.2 13958.6 13958.6 15032.4 orderStatus
3617.9s 0 1567.3 3113.0 13958.6 15032.4 15569.3 19327.4 payment
3617.9s 0 683.6 311.3 11811.2 13958.6 14495.5 15032.4 stockLevel
3618.9s 0 225.1 311.1 12884.9 15032.4 19327.4 20401.1 delivery
3618.9s 0 2077.0 3112.6 15032.4 16106.1 16643.0 20401.1 newOrder
3618.9s 0 614.6 311.6 12884.9 15032.4 15032.4 16106.1 orderStatus
3618.9s 0 2004.3 3112.7 14495.5 15032.4 16106.1 16643.0 payment
3618.9s 0 613.6 311.4 12884.9 15032.4 15569.3 16106.1 stockLevel
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
3619.9s 0 335.0 311.1 13421.8 16106.1 16643.0 20401.1 delivery
3619.9s 0 3185.4 3112.6 15569.3 16106.1 17179.9 18253.6 newOrder
3619.9s 0 214.0 311.5 12884.9 15569.3 16106.1 16106.1 orderStatus
3619.9s 0 2850.3 3112.6 15032.4 16106.1 16106.1 17179.9 payment
3619.9s 0 254.0 311.4 13421.8 15569.3 16106.1 16106.1 stockLevel
3620.9s 0 365.7 311.2 14495.5 16643.0 17179.9 21474.8 delivery
3620.9s 0 2614.0 3112.5 13958.6 16643.0 17179.9 22548.6 newOrder
3620.9s 0 122.2 311.5 9663.7 12348.0 15032.4 15032.4 orderStatus
3620.9s 0 2686.1 3112.5 10737.4 16643.0 17179.9 18253.6 payment
3620.9s 0 111.2 311.4 10200.5 15032.4 16643.0 18253.6 stockLevel
3621.9s 0 225.6 311.1 15032.4 17179.9 18253.6 18253.6 delivery
3621.9s 0 668.7 3111.8 10737.4 17179.9 18253.6 19327.4 newOrder
3621.9s 0 40.3 311.4 10737.4 11274.3 12884.9 12884.9 orderStatus
3621.9s 0 960.8 3111.9 10737.4 17179.9 17179.9 19327.4 payment
3621.9s 0 36.3 311.3 8589.9 11274.3 15032.4 15032.4 stockLevel
3622.9s 0 0.0 311.0 0.0 0.0 0.0 0.0 delivery
3622.9s 0 3.0 3111.0 20401.1 21474.8 21474.8 21474.8 newOrder
3622.9s 0 0.0 311.3 0.0 0.0 0.0 0.0 orderStatus
3622.9s 0 1.0 3111.0 19327.4 19327.4 19327.4 19327.4 payment
3622.9s 0 0.0 311.2 0.0 0.0 0.0 0.0 stockLevel
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
3623.9s 0 2.0 311.0 11274.3 25769.8 25769.8 25769.8 delivery
3623.9s 0 46.0 3110.1 21474.8 22548.6 24696.1 24696.1 newOrder
3623.9s 0 5.0 311.2 18253.6 18253.6 18253.6 18253.6 orderStatus
3623.9s 0 29.0 3110.2 11274.3 21474.8 21474.8 21474.8 payment
3623.9s 0 14.0 311.1 15569.3 18253.6 20401.1 20401.1 stockLevel
3624.9s 0 2.0 310.9 11274.3 13958.6 13958.6 13958.6 delivery
3624.9s 0 25.1 3109.3 11274.3 20401.1 21474.8 21474.8 newOrder
3624.9s 0 6.0 311.2 13958.6 18253.6 18253.6 18253.6 orderStatus
3624.9s 0 34.1 3109.3 11274.3 20401.1 20401.1 20401.1 payment
3624.9s 0 1.0 311.0 18253.6 18253.6 18253.6 18253.6 stockLevel
Error: error in newOrder: ERROR: result is ambiguous: error=failed to connect to n5 at ydb-vla-dev04-002.search.yandex.net:26258: operation "conn heartbeat" timed out after 6s (given timeout 6s): grpc: context deadline exceeded [code 4/DeadlineExceeded] [exhausted] (last error: failed to connect to n5 at ydb-vla-dev04-002.search.yandex.net:26258: operation "conn heartbeat" timed out after 6s (given timeout 6s): grpc: context deadline exceeded [code 4/DeadlineExceeded]) (SQLSTATE 40003)
I took a look at this and I ruled out the following things:
GC TTL
on SQL stats. Was changed from 4 hours to 1 hour in 23.2, but reverting didn't change anything.AUTO SQL STATS
Job. SQL STATS for tpcc.public.order_line
was taking a long time (probably expected as it is the largest table). Disabling didn't change anything.One thing I was able to narrow in on was that this issue doesn't pop up at the 60 minute mark, but rather the 90 minute mark. I ran the workload with an arbitrary ramp up of 17m
and we see the spike after 73m
after ramp up or 90m
after workload run
was called. Does something running every 90 minutes ring a bell to anyone?
One more thing I tried was to run everything the same, but with only 1000 warehouses
. We don't see any spike at all, suggesting that this is maybe an overload issue like originally mentioned? This checks out as we also don't see this spike in our tpcc/weekly
roachtest which runs with 5 nodes and 1000 warehouses for 4 days.
No concrete answer from my investigations, but figured I'd share to save others the time of going through the same rabbit holes I did. Happy to help try anything else others can think of too.
We're also seeing a high number of pushed transactions which appears to co-relate precisely with the spike onset, e.g.,
@nvanbenschoten Any idea(s) what the source(s) of contention might be? Is there a vmodule which could help with identifying the sources of contention?
Any idea(s) what the source(s) of contention might be? Is there a vmodule which could help with identifying the sources of contention?
The contention is likely a consequence of the slowdown, not a cause. TPC-C with wait times has low contention under steady state. However, when transactions slow down and the number of concurrent transactions increases, contention can rapidly increase.
Hi folks, is there anything else I can do to help you triaging the issue?
Hi,
I work on a performance survey. I try to execute TPC-C and it fails, when number of warehouses is above 12K (CPU and disks are not saturated though). However, I would expect CRDB to handle around 15-20K.
TPC-C run exits with either error:
or
Some time ago I had reported this issue in Slack and the conclusion was that CRDB was overloaded. Now, I try the newer version of CRDB.
Setup.
On each machine I start 4 CRDB instances (v23.2.2). Each in its own NUMA-aligned taskset. Here is an example:
TPC-C run.
TPC-C import:
TPC-C run:
in COCKROACH_ADDRS I specify all CRDB instances
Note, that I don't use enterprise feature of collocation.
At some point performance degrades:
The same I see in the CRDB metrics (note, that before the degradation CPU utilization was just 0.75 and disk is also underutilized):
Jira issue: CRDB-36392
Epic CRDB-40201