cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.11k stars 3.81k forks source link

CockroachDB may experience sudden high latency across the entire cluster during write operations. #129210

Closed wuhuua closed 1 week ago

wuhuua commented 2 months ago

Describe the problem I deployed a cluster using self-compiled CockroachDB v24.2.0-dev which has 50 nodes. When trying to start a write batch to the cluster, I found that CockroachDB may experience sudden high latency across the entire cluster during write operations.

企业微信截图_61ff9d75-b93b-4e40-8d23-20efae39de88

There's no network problem during my writing process, so I wonder why these sudden high latency occurs.

My steps to run the cluster and write process:

  1. Set up CockroachDB cluster /symmetricdbserver start --certs-dir=/local/certs --store=/symmetricdb/symmetricdb-data --listen-addr {{env "attr.unique.network.ip-address"}}:26259 --http-addr 0.0.0.0:8081 --join 0.symmetricdb-cluster-secure.service.consul:26259
  2. Send SQL ... / CLI command ... Write batch SQL is executing in a transaction as follows:
    # In a transaction
    INSERT INTO common_controls(created_at, updated_at, deleted_at, uin, bind_new_card_errcode, final_support_bank_list,
    bank_limit_priority, id)
    VALUES (_, __more__) RETURNING id, id
    INSERT INTO pay_methods(created_at, updated_at, deleted_at, uin, pay_method_type, account_type, bind_serial,
    bank_type, pay_method_name, logo_url, default_card_setting_state, default_favor_compose_id, account_id,
    is_default_pay_method, pay_method_color_type, id)
    VALUES (_, __more__) RETURNING id, id

Environment:

Jira issue: CRDB-41447

blathers-crl[bot] commented 2 months ago

Hi @wuhuua, please add branch-* labels to identify which branch(es) this C-bug affects.

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

blathers-crl[bot] commented 2 months ago

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

If we have not gotten back to your issue within a few business days, you can try the following:

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

DrewKimball commented 2 months ago

Hi @wuhuua, thanks for the report. To help me investigate, it would be helpful if you could gather some information for me:

You can send the results to me privately via this link.

wuhuua commented 2 months ago

tsdump sent, with a detailed description

DrewKimball commented 2 months ago

It looks like CPU is intermittently reaching ~100% on individual nodes. KV requests to a high-utilization node are delayed, and since we're bulk-inserting, a given SQL statement is pretty likely to hit the slow node.

Screenshot 2024-08-29 at 12 09 59 PM Screenshot 2024-08-29 at 12 10 16 PM Screenshot 2024-08-29 at 12 10 31 PM Screenshot 2024-08-29 at 12 10 47 PM
DrewKimball commented 2 months ago

A good next step would be to try and figure out why CPU utilization is so high on certain nodes. We should be collecting CPU profiles automatically, which you can access via the logging directory cockroach-data/logs/pprof_dump. Let's grab the profiles for several different nodes to make sure we capture an interesting period.

I also have a few questions about your hardware/workload:

mgartner commented 1 week ago

@wuhuua I'm going to close this issue for now. Please feel free to open a new issue (or reply to this issue) if you have further questions.