cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.9k stars 3.78k forks source link

Performance degradation during TPC-C run #119924

Open eivanov89 opened 6 months ago

eivanov89 commented 6 months ago

Hi,

I work on a performance survey. I try to execute TPC-C and it fails, when number of warehouses is above 12K (CPU and disks are not saturated though). However, I would expect CRDB to handle around 15-20K.

TPC-C run exits with either error:

Error: error in delivery: ERROR: inbox communication error: rpc error: code = Canceled desc = context canceled (SQLSTATE 58C01)

or

Error: error in delivery: ERROR: rpc error: code = Unavailable desc = error reading from server: read tcp [2a02:6b8:c34:14:0:1354:eb1f:2ca6]:37234->[2a02:6b8:c34:14:0:1354:eb1f:2962]:26258: use of closed network connection (SQLSTATE XXUUU)

Some time ago I had reported this issue in Slack and the conclusion was that CRDB was overloaded. Now, I try the newer version of CRDB.

Setup.

On each machine I start 4 CRDB instances (v23.2.2). Each in its own NUMA-aligned taskset. Here is an example:

/opt/cockroach/cockroach start --insecure --listen-addr=:26260 --http-addr=:8083 --join=host:126257,host2:26257,host3:26257 --locality=region=us-west-1 --store /opt/cockroach/data/nvme3n1p2 --cache=37GB --max-sql-memory=37GB

TPC-C run.

TPC-C import:

cockroach workload fixtures import tpcc --warehouses=13000 'postgres://root@$COCKROACH_HOST:26257?sslmode=disable

TPC-C run:

cockroach workload run tpcc --warehouses=$wh --ramp=30m --duration=720m $COCKROACH_ADDRS

in COCKROACH_ADDRS I specify all CRDB instances

Note, that I don't use enterprise feature of collocation.

At some point performance degrades:

 3595.8s        0         2659.5         2727.2     31.5     67.1     92.3    130.0 newOrder
 3595.8s        0          283.6          272.7      8.4     15.2     17.8     26.2 orderStatus
 3595.8s        0         2573.5         2726.6     17.8     48.2     79.7    121.6 payment
 3595.8s        0          256.0          272.6     16.8     24.1     28.3     31.5 stockLevel
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
 3596.8s        0          270.9          272.7     58.7     83.9    100.7    117.4 delivery
 3596.8s        0         2742.8         2727.2     29.4     46.1     58.7     83.9 newOrder
 3596.8s        0          252.8          272.7      8.1     12.6     15.7     22.0 orderStatus
 3596.8s        0         2687.7         2726.6     16.3     29.4     37.7     71.3 payment
 3596.8s        0          261.8          272.6     16.3     24.1     31.5     35.7 stockLevel
 3597.8s        0          289.9          272.7     62.9    104.9    130.0    369.1 delivery
 3597.8s        0         2694.4         2727.2     31.5     54.5     83.9    117.4 newOrder
 3597.8s        0          245.5          272.7      8.9     17.8     30.4     32.5 orderStatus
 3597.8s        0         2718.6         2726.6     16.8     31.5     52.4    335.5 payment
 3597.8s        0          278.8          272.6     16.8     26.2     30.4     31.5 stockLevel
 3603.9s        0            8.4          272.2     62.9    121.6  15032.4  15032.4 delivery
 3603.9s        0           93.1         2723.0     33.6     79.7  15032.4  15032.4 newOrder
 3603.9s        0            8.2          272.3      8.1     15.2     21.0  15032.4 orderStatus
 3603.9s        0           89.8         2722.4     18.9     60.8   5637.1  15032.4 payment
 3603.9s        0            9.7          272.2     17.8     32.5  15032.4  15032.4 stockLevel
 3613.4s        0            8.0          271.5  15032.4  15569.3  15569.3  15569.3 delivery
 3613.4s        0          120.5         2715.8  15032.4  15569.3  15569.3  15569.3 newOrder
 3613.4s        0           44.2          271.6  11811.2  15032.4  15032.4  15032.4 orderStatus
 3613.4s        0          144.6         2715.3  15032.4  15569.3  15569.3  15569.3 payment
 3613.4s        0           56.2          271.5  11811.2  12348.0  15032.4  15032.4 stockLevel
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
 3613.8s        0            0.0          271.5      0.0      0.0      0.0      0.0 delivery
 3613.8s        0           32.0         2715.5  15569.3  15569.3  15569.3  15569.3 newOrder
 3613.8s        0           12.8          271.5  12348.0  12348.0  12348.0  12348.0 orderStatus
 3613.8s        0           51.2         2714.9  12348.0  15569.3  15569.3  15569.3 payment
 3613.8s        0           19.2          271.5  12348.0  13421.8  13421.8  13421.8 stockLevel
 3614.8s        0           25.6          271.4  11274.3  13958.6  16643.0  16643.0 delivery
 3614.8s        0          284.5         2714.8  11274.3  16106.1  16643.0  16643.0 newOrder
 3614.8s        0          237.2          271.5  10737.4  11274.3  11274.3  11811.2 orderStatus
 3614.8s        0          771.7         2714.4  11274.3  12884.9  13958.6  16106.1 payment
 3614.8s        0          174.2          271.4  10737.4  11274.3  11274.3  11274.3 stockLevel
 3615.8s        0           84.1          271.4  11811.2  12348.0  17179.9  18253.6 delivery
 3615.8s        0         1726.1         2714.7  11811.2  12348.0  13421.8  18253.6 newOrder
 3615.8s        0          152.8          271.5  11274.3  12348.0  12348.0  12348.0 orderStatus
 3615.8s        0          959.6         2714.0  11811.2  12348.0  12348.0  15032.4 payment
 3615.8s        0          143.3          271.4  11811.2  12348.0  12348.0  12348.0 stockLevel
 3616.8s        0           91.7          271.3  12348.0  12884.9  15032.4  18253.6 delivery
 3616.8s        0         3171.8         2714.6  12348.0  13421.8  13421.8  18253.6 newOrder
 3616.8s        0          118.5          271.5   7516.2  12884.9  12884.9  14495.5 orderStatus
 3616.8s        0         1673.8         2713.6  12348.0  13421.8  13421.8  14495.5 payment
 3616.8s        0          155.4          271.4  12348.0  12884.9  13421.8  14495.5 stockLevel
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
 3617.8s        0          140.2          271.3  13421.8  15569.3  19327.4  19327.4 delivery
 3617.8s        0         2473.2         2714.6  13421.8  14495.5  15569.3  20401.1 newOrder
 3617.8s        0           90.4          271.4   4563.4  13958.6  13958.6  13958.6 orderStatus
 3617.8s        0         2613.4         2713.6  13421.8  13958.6  14495.5  17179.9 payment
 3617.8s        0          150.4          271.3   9126.8  13958.6  13958.6  15569.3 stockLevel
 3618.8s        0           81.4          271.2   8053.1  15032.4  20401.1  20401.1 delivery
 3618.8s        0         2750.7         2714.6  14495.5  15032.4  15569.3  20401.1 newOrder
 3618.8s        0          139.9          271.4   5368.7  14495.5  14495.5  14495.5 orderStatus
 3618.8s        0         1015.1         2713.2  14495.5  15032.4  15032.4  15569.3 payment
 3618.8s        0          135.0          271.3   5368.7  14495.5  15032.4  15032.4 stockLevel
 3619.8s        0          113.5          271.2   6710.9  16106.1  18253.6  21474.8 delivery
 3619.8s        0         4704.6         2715.2  15569.3  16106.1  16106.1  22548.6 newOrder
 3619.8s        0          174.7          271.3   6442.5   6979.3   6979.3   6979.3 orderStatus
 3619.8s        0         1086.8         2712.7   6979.3  16106.1  16106.1  17179.9 payment
 3619.8s        0          197.4          271.3   6442.5   6979.3  16106.1  16106.1 stockLevel
 3620.8s        0          248.4          271.2   7516.2  17179.9  17179.9  19327.4 delivery
 3620.8s        0         4190.7         2715.6   7784.6  17179.9  17179.9  22548.6 newOrder
 3620.8s        0          258.4          271.3   7516.2   7784.6   7784.6   8053.1 orderStatus
 3620.8s        0         1339.7         2712.3   7516.2  16643.0  17179.9  20401.1 payment
 3620.8s        0          301.9          271.3   7516.2   7784.6  10200.5  16106.1 stockLevel
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
 3621.8s        0          463.2          271.2   8321.5   9126.8  18253.6  19327.4 delivery
 3621.8s        0         3801.5         2715.9   8321.5  15569.3  18253.6  19327.4 newOrder
 3621.8s        0          234.6          271.3   6710.9   9126.8   9126.8   9126.8 orderStatus
 3621.8s        0         2170.3         2712.2   8589.9   9126.8  18253.6  18253.6 payment
 3621.8s        0          239.5          271.3   7784.6   9126.8   9126.8  17179.9 stockLevel
 3622.8s        0           25.9          271.2   9126.8   9126.8   9126.8   9126.8 delivery
 3622.8s        0          562.5         2715.3   9126.8   9126.8  18253.6  18253.6 newOrder
 3622.8s        0            3.2          271.3   9126.8   9126.8   9126.8   9126.8 orderStatus
 3622.8s        0          493.5         2711.6   9126.8   9126.8   9126.8  18253.6 payment
 3622.8s        0           18.3          271.2   9126.8   9126.8   9126.8   9126.8 stockLevel
 3623.8s        0            0.0          271.1      0.0      0.0      0.0      0.0 delivery
 3623.8s        0           10.0         2714.5  10737.4  10737.4  10737.4  10737.4 newOrder
 3623.8s        0            0.0          271.2      0.0      0.0      0.0      0.0 orderStatus
 3623.8s        0            0.0         2710.8      0.0      0.0      0.0      0.0 payment
 3623.8s        0            0.0          271.1      0.0      0.0      0.0      0.0 stockLevel
 3624.8s        0            1.0          271.0   8321.5   8321.5   8321.5   8321.5 delivery
 3624.8s        0            5.0         2713.8  11811.2  21474.8  21474.8  21474.8 newOrder
 3624.8s        0            0.0          271.1      0.0      0.0      0.0      0.0 orderStatus
 3624.8s        0            2.0         2710.1   9126.8  11274.3  11274.3  11274.3 payment
 3624.8s        0            0.0          271.1      0.0      0.0      0.0      0.0 stockLevel
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
 3625.8s        0            0.0          270.9      0.0      0.0      0.0      0.0 delivery
 3625.8s        0           16.0         2713.0  12884.9  12884.9  12884.9  12884.9 newOrder
 3625.8s        0            0.0          271.0      0.0      0.0      0.0      0.0 orderStatus
 3625.8s        0           26.9         2709.3  12348.0  12884.9  12884.9  12884.9 payment
 3625.8s        0            0.0          271.0      0.0      0.0      0.0      0.0 stockLevel
 3626.8s        0            1.0          270.9  12884.9  12884.9  12884.9  12884.9 delivery
 3626.8s        0           29.7         2712.3  13421.8  13958.6  23622.3  23622.3 newOrder
 3626.8s        0            6.9          271.0  13958.6  23622.3  23622.3  23622.3 orderStatus
 3626.8s        0           20.8         2708.6  12884.9  13958.6  16643.0  16643.0 payment
 3626.8s        0            2.0          270.9  11811.2  13958.6  13958.6  13958.6 stockLevel
 3627.8s        0           73.9          270.8  14495.5  24696.1  24696.1  24696.1 delivery
 3627.8s        0         2140.9         2712.2  14495.5  24696.1  24696.1  25769.8 newOrder
 3627.8s        0          227.5          271.0  14495.5  24696.1  24696.1  24696.1 orderStatus
 3627.8s        0         1043.6         2708.1  13958.6  17179.9  24696.1  24696.1 payment
 3627.8s        0          119.0          270.9  12884.9  15032.4  23622.3  24696.1 stockLevel
 3628.8s        0          132.9          270.8  15569.3  24696.1  24696.1  25769.8 delivery
 3628.8s        0         2796.4         2712.2  15569.3  25769.8  25769.8  30064.8 newOrder
 3628.8s        0          305.1          271.0  13958.6  24696.1  24696.1  25769.8 orderStatus
 3628.8s        0          787.5         2707.6  15032.4  18253.6  24696.1  25769.8 payment
 3628.8s        0          199.4          270.9  13958.6  16106.1  16106.1  20401.1 stockLevel
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
 3629.8s        0          230.7          270.8  15032.4  19327.4  25769.8  25769.8 delivery
 3629.8s        0         2753.3         2712.2  16643.0  25769.8  26843.5  27917.3 newOrder
 3629.8s        0          261.4          271.0  14495.5  16106.1  16643.0  16643.0 orderStatus
 3629.8s        0         1976.1         2707.4  16643.0  25769.8  26843.5  26843.5 payment
 3629.8s        0          249.5          270.8  14495.5  16106.1  16643.0  19327.4 stockLevel
 3630.8s        0          166.6          270.7  15569.3  18253.6  18253.6  26843.5 delivery
 3630.8s        0         3558.4         2712.4  17179.9  26843.5  26843.5  27917.3 newOrder
 3630.8s        0          156.5          270.9  15032.4  18253.6  18253.6  22548.6 orderStatus
 3630.8s        0         1684.3         2707.1  14495.5  26843.5  26843.5  27917.3 payment
 3630.8s        0          195.9          270.8  15032.4  18253.6  18253.6  18253.6 stockLevel
 3631.8s        0          300.6          270.7  16106.1  18253.6  19327.4  19327.4 delivery
 3631.8s        0         3974.6         2712.8  18253.6  27917.3  27917.3  28991.0 newOrder
 3631.8s        0          320.9          270.9  11811.2  16643.0  19327.4  19327.4 orderStatus
 3631.8s        0         2596.6         2707.1  15032.4  18253.6  19327.4  28991.0 payment
 3631.8s        0          289.4          270.8  12884.9  16643.0  18253.6  22548.6 stockLevel
 3632.8s        0           23.8          270.7  20401.1  28991.0  28991.0  28991.0 delivery
 3632.8s        0           25.9         2712.0  16643.0  19327.4  19327.4  19327.4 newOrder
 3632.8s        0            4.1          270.9   6979.3  11274.3  11274.3  11274.3 orderStatus
 3632.8s        0           36.3         2706.3   6979.3  16643.0  21474.8  21474.8 payment
 3632.8s        0           45.6          270.8  20401.1  28991.0  28991.0  28991.0 stockLevel
Error: error in stockLevel: ERROR: inbox communication error: rpc error: code = Canceled desc = context canceled (SQLSTATE 58C01)

The same I see in the CRDB metrics (note, that before the degradation CPU utilization was just 0.75 and disk is also underutilized):

2024-03-05_01_crdb_sql_statements 2024-03-05_00_raft_log 2024-03-05_00_crdb_queues 2024-03-05_14_distributed 2024-03-05_13_replication 2024-03-05_12_replication 2024-03-05_11_log_commit 2024-03-05_10_distributed_errors 2024-03-05_09_kv_latency 2024-03-05_08_clock_offset 2024-03-05_07_goroutines 2024-03-05_06_disk_write 2024-03-05_05_disk_read 2024-03-05_04_crdb_memory 2024-03-05_03_crdb_cpu_percent 2024-03-05_02_crdb_sql_statements

Jira issue: CRDB-36392

Epic CRDB-40201

blathers-crl[bot] commented 6 months ago

Hello, I am Blathers. I am here to help you get the issue triaged.

It looks like you have not filled out the issue in the format of any of our templates. To best assist you, we advise you to use one of these templates.

I have CC'd a few people who may be able to assist you:

If we have not gotten back to your issue within a few business days, you can try the following:

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

blathers-crl[bot] commented 6 months ago

cc @cockroachdb/cdc

blathers-crl[bot] commented 6 months ago

cc @cockroachdb/replication

eivanov89 commented 6 months ago

Just got another type of error during 12K run:

_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
 3596.8s        0          262.6          251.7     60.8     83.9     88.1    100.7 delivery
 3596.8s        0         2555.1         2518.0     30.4     44.0     52.4     88.1 newOrder
 3596.8s        0          251.6          251.7      8.9     13.1     14.7     21.0 orderStatus
 3596.8s        0         2551.1         2517.3     16.8     26.2     31.5    100.7 payment
 3596.8s        0          254.6          251.8     16.8     24.1     27.3     39.8 stockLevel
 3597.8s        0          247.6          251.7     62.9     96.5    113.2    121.6 delivery
 3597.8s        0         2571.5         2518.0     30.4     48.2     60.8     83.9 newOrder
 3597.8s        0          228.6          251.7      8.4     13.6     17.8     21.0 orderStatus
 3597.8s        0         2567.5         2517.3     16.8     32.5     52.4     71.3 payment
 3597.8s        0          234.6          251.8     16.3     24.1     30.4     41.9 stockLevel
 3603.6s        0            9.3          251.4     60.8     75.5  15032.4  15032.4 delivery
 3603.6s        0           89.4         2514.4     29.4     39.8  10200.5  15032.4 newOrder
 3603.6s        0            8.5          251.3      8.4     12.1     14.7  15032.4 orderStatus
 3603.6s        0           91.7         2513.7     16.3     23.1  10200.5  15032.4 payment
 3603.6s        0            9.4          251.4     15.7  11274.3  15032.4  15032.4 stockLevel
 3613.6s        0           37.1          250.7  15032.4  15569.3  15569.3  15569.3 delivery
 3613.6s        0          136.1         2507.4  15032.4  15569.3  15569.3  15569.3 newOrder
 3613.6s        0          123.7          250.6  11274.3  15032.4  15032.4  15032.4 orderStatus
 3613.6s        0          439.1         2506.7  11274.3  15032.4  15032.4  15569.3 payment
 3613.6s        0           86.6          250.7  10200.5  11274.3  12348.0  12348.0 stockLevel
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
 3613.8s        0            0.0          250.6      0.0      0.0      0.0      0.0 delivery
 3613.8s        0            8.3         2507.2  15569.3  15569.3  15569.3  15569.3 newOrder
 3613.8s        0            4.2          250.6   5637.1   5637.1   5637.1   5637.1 orderStatus
 3613.8s        0           83.2         2506.6  11811.2  12348.0  12348.0  12348.0 payment
 3613.8s        0           20.8          250.7   5637.1  11274.3  11274.3  11274.3 stockLevel
 3614.8s        0           47.0          250.6  11274.3  16643.0  16643.0  16643.0 delivery
 3614.8s        0          193.9         2506.6  11274.3  16106.1  16643.0  16643.0 newOrder
 3614.8s        0          328.7          250.6  10737.4  11274.3  11274.3  11811.2 orderStatus
 3614.8s        0          509.4         2506.0  11274.3  12884.9  13421.8  15569.3 payment
 3614.8s        0          240.9          250.7  10737.4  11274.3  11274.3  11811.2 stockLevel
 3615.8s        0          104.8          250.6  11811.2  12348.0  13421.8  13421.8 delivery
 3615.8s        0          916.5         2506.1  12348.0  13421.8  14495.5  18253.6 newOrder
 3615.8s        0          167.3          250.6  11811.2  12348.0  12348.0  12348.0 orderStatus
 3615.8s        0          827.1         2505.5  11811.2  12348.0  12348.0  13958.6 payment
 3615.8s        0          182.7          250.7  11274.3  12348.0  12348.0  12348.0 stockLevel
 3616.8s        0          102.8          250.5   8589.9  13421.8  13958.6  14495.5 delivery
 3616.8s        0         1686.9         2505.9  12884.9  13958.6  15032.4  15569.3 newOrder
 3616.8s        0          121.0          250.6   8589.9  12884.9  12884.9  12884.9 orderStatus
 3616.8s        0         1551.8         2505.3  12348.0  13421.8  13421.8  14495.5 payment
 3616.8s        0          162.3          250.6   8589.9  12884.9  13421.8  13958.6 stockLevel
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
 3617.8s        0           75.5          250.5  14495.5  14495.5  14495.5  14495.5 delivery
 3617.8s        0          479.8         2505.4  14495.5  14495.5  16643.0  16643.0 newOrder
 3617.8s        0           16.7          250.5   9663.7   9663.7  10200.5  10200.5 orderStatus
 3617.8s        0          408.1         2504.7   9663.7  14495.5  14495.5  16643.0 payment
 3617.8s        0            9.8          250.6   9663.7  14495.5  14495.5  14495.5 stockLevel
 3618.8s        0          132.2          250.4  10737.4  15032.4  15569.3  20401.1 delivery
 3618.8s        0         1160.2         2505.0  10737.4  15032.4  16106.1  17179.9 newOrder
 3618.8s        0           83.4          250.5  10200.5  15032.4  15032.4  15032.4 orderStatus
 3618.8s        0         1167.3         2504.3  10200.5  15032.4  15032.4  16643.0 payment
 3618.8s        0           76.3          250.5  10200.5  15032.4  15032.4  15032.4 stockLevel
 3619.8s        0            2.0          250.4  11274.3  13421.8  13421.8  13421.8 delivery
 3619.8s        0            7.1         2504.3  11274.3  15569.3  15569.3  15569.3 newOrder
 3619.8s        0            0.0          250.4      0.0      0.0      0.0      0.0 orderStatus
 3619.8s        0            8.1         2503.6  10737.4  11274.3  11274.3  11274.3 payment
 3619.8s        0            0.0          250.5      0.0      0.0      0.0      0.0 stockLevel
 3620.8s        0            0.0          250.3      0.0      0.0      0.0      0.0 delivery
 3620.8s        0            0.0         2503.6      0.0      0.0      0.0      0.0 newOrder
 3620.8s        0            0.0          250.3      0.0      0.0      0.0      0.0 orderStatus
 3620.8s        0            0.0         2502.9      0.0      0.0      0.0      0.0 payment
 3620.8s        0            0.0          250.4      0.0      0.0      0.0      0.0 stockLevel
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
 3621.8s        0            0.0          250.2      0.0      0.0      0.0      0.0 delivery
 3621.8s        0            0.0         2502.9      0.0      0.0      0.0      0.0 newOrder
 3621.8s        0            0.0          250.2      0.0      0.0      0.0      0.0 orderStatus
 3621.8s        0            0.0         2502.2      0.0      0.0      0.0      0.0 payment
 3621.8s        0            0.0          250.3      0.0      0.0      0.0      0.0 stockLevel
 3622.8s        0            1.0          250.2  20401.1  20401.1  20401.1  20401.1 delivery
 3622.8s        0            3.0         2502.2  21474.8  22548.6  22548.6  22548.6 newOrder
 3622.8s        0            0.0          250.2      0.0      0.0      0.0      0.0 orderStatus
 3622.8s        0            0.0         2501.6      0.0      0.0      0.0      0.0 payment
 3622.8s        0            0.0          250.3      0.0      0.0      0.0      0.0 stockLevel
 3623.8s        0           18.0          250.1  16106.1  20401.1  20401.1  20401.1 delivery
 3623.8s        0           30.1         2501.5  20401.1  22548.6  23622.3  23622.3 newOrder
 3623.8s        0           31.1          250.1  15569.3  20401.1  20401.1  20401.1 orderStatus
 3623.8s        0           68.2         2500.9  16106.1  20401.1  20401.1  20401.1 payment
 3623.8s        0           18.0          250.2  16106.1  20401.1  20401.1  20401.1 stockLevel
 3624.8s        0           45.4          250.0  21474.8  22548.6  26843.5  26843.5 delivery
 3624.8s        0          255.9         2500.9  21474.8  22548.6  23622.3  24696.1 newOrder
 3624.8s        0           25.7          250.1  16643.0  21474.8  21474.8  21474.8 orderStatus
 3624.8s        0           77.1         2500.2  21474.8  21474.8  21474.8  22548.6 payment
 3624.8s        0           23.7          250.1  16643.0  21474.8  21474.8  21474.8 stockLevel
Error: error in payment: ERROR: result is ambiguous: error=ba: EndTxn(parallel commit) [/Table/106/1/1728/0], [txn: 3bb86273] RPC error: grpc: error reading from server: read tcp [2a02:6b8:c34:14:0:5a59:eb1f:272e]:37384->[2a02:6b8:c34:14:0:5a59:eb1f:295a]:26259: use of closed network connection [code 14/Unavailable] [propagate] (last error: TransactionRetryError: retry txn (RETRY_SERIALIZABLE): "sql txn" meta={id=3bb86273 key=/Table/106/1/1728/0 iso=Serializable pri=0.00987058 epo=0 ts=1709652515.303041731,0 min=1709652504.989393852,0 seq=5} lock=true stat=PENDING rts=1709652504.989393852,0 wto=false gul=1709652505.489393852,0) (SQLSTATE 40003)

In 23.1.10 we were able to run 12K during 12 hours (and did many iterations). Might be a regression in v23.2.2.

eivanov89 commented 6 months ago

And one more for 12K:

_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
 3624.8s        0            5.9          250.5     92.3  11274.3  11274.3  11274.3 delivery
 3624.8s        0          132.9         2509.1  13421.8  21474.8  21474.8  21474.8 newOrder
 3624.8s        0           27.8          250.7   4563.4   9126.8  10200.5  10200.5 orderStatus
 3624.8s        0           87.2         2506.2  10200.5  13421.8  21474.8  21474.8 payment
 3624.8s        0           22.8          250.7   5100.3  11274.3  11274.3  11274.3 stockLevel
 3625.8s        0          234.6          250.5   7247.8  13421.8  13421.8  13421.8 delivery
 3625.8s        0         1738.3         2508.9  11811.2  13958.6  21474.8  22548.6 newOrder
 3625.8s        0          158.0          250.7   2818.6   9126.8   9663.7  11274.3 orderStatus
 3625.8s        0         1269.1         2505.8   7247.8  13421.8  13958.6  22548.6 payment
 3625.8s        0          127.0          250.7   3623.9  11274.3  13421.8  13958.6 stockLevel
 3626.8s        0          273.2          250.5  11274.3  15032.4  15569.3  15569.3 delivery
 3626.8s        0         1524.5         2508.7  12884.9  15569.3  23622.3  23622.3 newOrder
 3626.8s        0          232.9          250.7   2684.4  12348.0  12884.9  12884.9 orderStatus
 3626.8s        0         1392.4         2505.5   6442.5  15032.4  15569.3  23622.3 payment
 3626.8s        0          244.0          250.7   4563.4  12884.9  14495.5  14495.5 stockLevel
 3627.8s        0          325.8          250.5   8589.9  15569.3  16106.1  23622.3 delivery
 3627.8s        0         2614.5         2508.7  12348.0  16106.1  23622.3  24696.1 newOrder
 3627.8s        0          339.6          250.7   3489.7  10200.5  12348.0  13958.6 orderStatus
 3627.8s        0         1522.2         2505.3  10200.5  15569.3  16106.1  24696.1 payment
 3627.8s        0          327.8          250.7   4563.4  13958.6  15569.3  16643.0 stockLevel
Error: error in payment: ERROR: result is ambiguous: error=ba: EndTxn(parallel commit) [/Table/106/1/9595/0], [txn: 13b9d900] RPC error: grpc: error reading from server: read tcp [2a02:6b8:c34:14:0:5a59:eb1f:2afe]:33976->[2a02:6b8:c34:14:0:5a59:eb1f:272e]:26258: use of closed network connection [code 14/Unavailable] [propagate] (last error: TransactionAbortedError(ABORT_REASON_ABORT_SPAN): "sql txn" meta={id=13b9d900 key=/Table/106/1/9595/0 iso=Serializable pri=0.01199064 epo=0 ts=1709668789.912520706,6 min=1709668789.912520706,6 seq=4} lock=true stat=ABORTED rts=1709668789.912520706,6 wto=false gul=1709668790.412520706,6) (SQLSTATE 40003)

All happen almost same time after 30M of rump and 1 hour of execution.

eivanov89 commented 6 months ago

Neither 10K works:

 3629.9s        0           17.2          208.4  17179.9  18253.6  18253.6  18253.6 orderStatus
 3629.9s        0          394.3         2083.3  16106.1  20401.1  27917.3  27917.3 payment
 3629.9s        0           20.2          208.4  17179.9  19327.4  19327.4  19327.4 stockLevel
Error: error in payment: ERROR: result is ambiguous: error=failed to connect to n5 at ydb-vla-dev04-006.search.yandex.net:26259: operation "conn heartbeat" timed out after 6.134s (given timeout 6s): grpc: context deadline exceeded [code 4/DeadlineExceeded] [propagate] (last error: TransactionRetryError: retry txn (RETRY_SERIALIZABLE): "sql txn" meta={id=0c1857fb key=/Table/106/1/4220/0 iso=Serializable pri=0.02181940 epo=0 ts=1709677455.990413313,22 min=1709677439.787321858,0 seq=5} lock=true stat=PENDING rts=1709677449.113683691,1 wto=false gul=1709677440.287321858,0) (SQLSTATE 40003)
nvanbenschoten commented 6 months ago

Hi @eivanov89, thank you for filing the detailed report! The symptoms you describe do look vaguely like overload, but the cause of the overload is not obvious, so we should dig in. The pattern you noticed with each failure happening at almost exactly 1 hour of execution is suspicious and is likely related.


I was able to reproduce what you are seeing using the following steps and our roachprod tool:

roachprod create nathan-tpcc -n 13 --gce-machine-type=n2-standard-32
roachprod stage nathan-tpcc release v23.2.2
roachprod start nathan-tpcc:1-12
roachprod adminurl --open nathan-tpcc:1
roachprod run nathan-tpcc:13 -- './cockroach workload fixtures import tpcc --warehouses=13000 {pgurl:1}'
roachprod run nathan-tpcc:13 -- './cockroach workload run tpcc --ramp=30m --duration=720m --warehouses=13000 {pgurl:1-12}'

At the hour mark, I also see a spike in latency and a drop in throughput. In my case, the workload kept running, but the impact is just as clear:

Screenshot 2024-03-22 at 2 54 56 AM

Looking at the SQL dashboard, we see a spike in SQL connections right at this time:

Screenshot 2024-03-22 at 2 55 48 AM

Given the roundness of the timing of this event (exactly 1 hour into the workload) and the spike in SQL connections, I have a strong suspicion that this has something to do with the workload re-establishing all SQL connections at the hour. It may be due to the pgx driver's (a recent addition to the tpc-c workload) MaxConnLifetime setting, which defaults to 1 hour. Something similar to the issue fixed by https://github.com/cockroachdb/cockroach/commit/bdf3f6262e380b867aa8884016eae53d28d074a4. It may also be something else.

Either way, this looks like a deficiency in the tpc-c workload and not in cockroachdb itself. I'll try to understand better tomorrow. Thank you again for flagging.

sean- commented 6 months ago

Can you change the connection max age to 5min and add a 30s connection lifetime jitter to the connection URL? We've chased this client-side artifact before and I wonder if this is the same phenomena.

eivanov89 commented 5 months ago

Can you change the connection max age to 5min and add a 30s connection lifetime jitter to the connection URL? We've chased this client-side artifact before and I wonder if this is the same phenomena.

Sure. May I ask you to provide the precise setting names and the way to modify them in your tpc-c impl?

srosenberg commented 5 months ago

Sure. May I ask you to provide the precise setting names and the way to modify them in your tpc-c impl?

Add the following CLI options to the workload binary, --max-conn-lifetime 5m --max-conn-lifetime-jitter 30s, e.g.,

./cockroach workload run tpcc --ramp=30m --duration=720m --warehouses=13000 --max-conn-lifetime 5m --max-conn-lifetime-jitter 30s $pgurls

These are common connection options shared by all the workloads: https://github.com/cockroachdb/cockroach/blob/7ff7978ba43b4c4b7dec6b55206b84cc9b17ffcc/pkg/workload/connection.go#L47-L54

eivanov89 commented 5 months ago

Thank you for clarifying, Stan. I tried, but the result is the same.

Loaded 15K warehouses

./cockroach workload fixtures import tpcc --warehouses=15000 'postgres://root@ydb-vla-dev04-000:26257?sslmode=disable

During the load I reproduced unrelated issue #110738.

Then started:

./cockroach workload run tpcc --ramp=30m --duration=720m --warehouses=15000 --max-conn-lifetime 5m --max-conn-lifetime-jitter 30s $crdb_addr
...
 3598.9s        0          264.5          312.7    167.8    369.1    436.2    570.4 delivery
 3598.9s        0         2517.1         3128.8    130.0    402.7    536.9    906.0 newOrder
 3598.9s        0          282.6          312.8      9.4     25.2     41.9     52.4 orderStatus
 3598.9s        0         2664.3         3128.6     65.0    318.8    419.4    637.5 payment
 3598.9s        0          313.6          312.7     19.9     44.0    151.0    570.4 stockLevel
 3601.0s        0            5.6          312.6   5100.3   5637.1   5637.1  17179.9 delivery
 3601.0s        0           57.6         3127.2   5100.3  17179.9  17179.9  17179.9 newOrder
 3601.0s        0            3.3          312.7      7.9  12348.0  13421.8  13421.8 orderStatus
 3601.0s        0           54.2         3127.0   5100.3   7516.2  17179.9  18253.6 payment
 3601.0s        0            3.3          312.6     18.9  12348.0  13421.8  13421.8 stockLevel
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
 3616.3s        0           50.9          311.3   7516.2  18253.6  18253.6  18253.6 delivery
 3616.3s        0          876.4         3114.1  17179.9  18253.6  18253.6  18253.6 newOrder
 3616.3s        0          260.1          311.4  10200.5  13421.8  13421.8  13421.8 orderStatus
 3616.3s        0          373.2         3113.9   7516.2  18253.6  18253.6  18253.6 payment
 3616.3s        0          282.7          311.3  12348.0  13421.8  17179.9  17179.9 stockLevel
 3616.9s        0            3.2          311.2  18253.6  18253.6  18253.6  18253.6 delivery
 3616.9s        0           55.5         3113.5  18253.6  18253.6  18253.6  18253.6 newOrder
 3616.9s        0          250.7          311.3  10737.4  13421.8  13958.6  13958.6 orderStatus
 3616.9s        0          380.8         3113.4  13421.8  18253.6  18253.6  18253.6 payment
 3616.9s        0          269.8          311.2  10737.4  13958.6  13958.6  13958.6 stockLevel
 3617.9s        0          170.4          311.2  11811.2  15032.4  19327.4  19327.4 delivery
 3617.9s        0          856.1         3112.9  14495.5  18253.6  19327.4  19327.4 newOrder
 3617.9s        0          835.0          311.5  11811.2  13958.6  13958.6  15032.4 orderStatus
 3617.9s        0         1567.3         3113.0  13958.6  15032.4  15569.3  19327.4 payment
 3617.9s        0          683.6          311.3  11811.2  13958.6  14495.5  15032.4 stockLevel
 3618.9s        0          225.1          311.1  12884.9  15032.4  19327.4  20401.1 delivery
 3618.9s        0         2077.0         3112.6  15032.4  16106.1  16643.0  20401.1 newOrder
 3618.9s        0          614.6          311.6  12884.9  15032.4  15032.4  16106.1 orderStatus
 3618.9s        0         2004.3         3112.7  14495.5  15032.4  16106.1  16643.0 payment
 3618.9s        0          613.6          311.4  12884.9  15032.4  15569.3  16106.1 stockLevel
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
 3619.9s        0          335.0          311.1  13421.8  16106.1  16643.0  20401.1 delivery
 3619.9s        0         3185.4         3112.6  15569.3  16106.1  17179.9  18253.6 newOrder
 3619.9s        0          214.0          311.5  12884.9  15569.3  16106.1  16106.1 orderStatus
 3619.9s        0         2850.3         3112.6  15032.4  16106.1  16106.1  17179.9 payment
 3619.9s        0          254.0          311.4  13421.8  15569.3  16106.1  16106.1 stockLevel
 3620.9s        0          365.7          311.2  14495.5  16643.0  17179.9  21474.8 delivery
 3620.9s        0         2614.0         3112.5  13958.6  16643.0  17179.9  22548.6 newOrder
 3620.9s        0          122.2          311.5   9663.7  12348.0  15032.4  15032.4 orderStatus
 3620.9s        0         2686.1         3112.5  10737.4  16643.0  17179.9  18253.6 payment
 3620.9s        0          111.2          311.4  10200.5  15032.4  16643.0  18253.6 stockLevel
 3621.9s        0          225.6          311.1  15032.4  17179.9  18253.6  18253.6 delivery
 3621.9s        0          668.7         3111.8  10737.4  17179.9  18253.6  19327.4 newOrder
 3621.9s        0           40.3          311.4  10737.4  11274.3  12884.9  12884.9 orderStatus
 3621.9s        0          960.8         3111.9  10737.4  17179.9  17179.9  19327.4 payment
 3621.9s        0           36.3          311.3   8589.9  11274.3  15032.4  15032.4 stockLevel
 3622.9s        0            0.0          311.0      0.0      0.0      0.0      0.0 delivery
 3622.9s        0            3.0         3111.0  20401.1  21474.8  21474.8  21474.8 newOrder
 3622.9s        0            0.0          311.3      0.0      0.0      0.0      0.0 orderStatus
 3622.9s        0            1.0         3111.0  19327.4  19327.4  19327.4  19327.4 payment
 3622.9s        0            0.0          311.2      0.0      0.0      0.0      0.0 stockLevel
_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
 3623.9s        0            2.0          311.0  11274.3  25769.8  25769.8  25769.8 delivery
 3623.9s        0           46.0         3110.1  21474.8  22548.6  24696.1  24696.1 newOrder
 3623.9s        0            5.0          311.2  18253.6  18253.6  18253.6  18253.6 orderStatus
 3623.9s        0           29.0         3110.2  11274.3  21474.8  21474.8  21474.8 payment
 3623.9s        0           14.0          311.1  15569.3  18253.6  20401.1  20401.1 stockLevel
 3624.9s        0            2.0          310.9  11274.3  13958.6  13958.6  13958.6 delivery
 3624.9s        0           25.1         3109.3  11274.3  20401.1  21474.8  21474.8 newOrder
 3624.9s        0            6.0          311.2  13958.6  18253.6  18253.6  18253.6 orderStatus
 3624.9s        0           34.1         3109.3  11274.3  20401.1  20401.1  20401.1 payment
 3624.9s        0            1.0          311.0  18253.6  18253.6  18253.6  18253.6 stockLevel
Error: error in newOrder: ERROR: result is ambiguous: error=failed to connect to n5 at ydb-vla-dev04-002.search.yandex.net:26258: operation "conn heartbeat" timed out after 6s (given timeout 6s): grpc: context deadline exceeded [code 4/DeadlineExceeded] [exhausted] (last error: failed to connect to n5 at ydb-vla-dev04-002.search.yandex.net:26258: operation "conn heartbeat" timed out after 6s (given timeout 6s): grpc: context deadline exceeded [code 4/DeadlineExceeded]) (SQLSTATE 40003)
DarrylWong commented 5 months ago

I took a look at this and I ruled out the following things:

  1. Connection Lifetime/Jitter. Changing it to 5 minutes didn't fix it, and the default should be 5 minutes anyway.
  2. GC TTL on SQL stats. Was changed from 4 hours to 1 hour in 23.2, but reverting didn't change anything.
  3. AUTO SQL STATS Job. SQL STATS for tpcc.public.order_line was taking a long time (probably expected as it is the largest table). Disabling didn't change anything.

One thing I was able to narrow in on was that this issue doesn't pop up at the 60 minute mark, but rather the 90 minute mark. I ran the workload with an arbitrary ramp up of 17m and we see the spike after 73m after ramp up or 90m after workload run was called. Does something running every 90 minutes ring a bell to anyone?

One more thing I tried was to run everything the same, but with only 1000 warehouses. We don't see any spike at all, suggesting that this is maybe an overload issue like originally mentioned? This checks out as we also don't see this spike in our tpcc/weekly roachtest which runs with 5 nodes and 1000 warehouses for 4 days.

image

No concrete answer from my investigations, but figured I'd share to save others the time of going through the same rabbit holes I did. Happy to help try anything else others can think of too.

srosenberg commented 5 months ago

We're also seeing a high number of pushed transactions which appears to co-relate precisely with the spike onset, e.g.,

transaction_restarts

@nvanbenschoten Any idea(s) what the source(s) of contention might be? Is there a vmodule which could help with identifying the sources of contention?

nvanbenschoten commented 5 months ago

Any idea(s) what the source(s) of contention might be? Is there a vmodule which could help with identifying the sources of contention?

The contention is likely a consequence of the slowdown, not a cause. TPC-C with wait times has low contention under steady state. However, when transactions slow down and the number of concurrent transactions increases, contention can rapidly increase.

eivanov89 commented 5 months ago

Hi folks, is there anything else I can do to help you triaging the issue?