storage: store capacity was 100% though the used cluster capacity was at 60%

nameisbhaskar commented 5 days ago

Node 75 has gone down in drt-scale cluster The node went down due to disk full. The behaviour is strange as only one store capacity went to 100%. Here is the changes that caused the change in the sizes:

At 5:20 AM, we dropped DB of size 123TB, thats why the capacity came down at 6:30AM.
At 12:30, we started the import for a DB of size 68TB which finished at 16:25.
At 16:38, we dropped another DB of 7TB capacity. CG completed by 21:55.
At 16:58, we started import of a DB of 570TB capacity which was complete by 18:28.
Post this, TPCC workload was running and changefeeds were started.

The graph shows that around 12:00 PM, the capacity of store 298 went down and then kept increasing till 100%. The overall capacity usage of the cluster is only around 60%.

Debug zip location - https://console.cloud.google.com/storage/browser/_details/150-node-cluster-debug/debug-zips/2024_11_07_02_35_20.zip;tab=live_object?project=cockroach-drt. Slack thread - https://cockroachlabs.slack.com/archives/C07HPMBLVJ7/p1730932142869589 Datadog link - https://us5.datadoghq.com/dashboard/pbe-ic2-3qt/drt?fromUser=true&refresh_mode=paused[…]d-scale&from_ts=1730872800000&to_ts=1730933220000&live=false ![Uploading storage_graph.png…]()

Jira issue: CRDB-44105

Epic CRDB-44205

blathers-crl[bot] commented 5 days ago

Hi @nameisbhaskar, please add branch-* labels to identify which branch(es) this C-bug affects.

_{:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

itsbilal commented 5 days ago

From a storage standpoint I don't see any snapshot-pinned bytes or any compaction anomalies that'd explain why this store ran out of disk space. However I do see that while other stores on this node rebalanced replicas away, this one didn't:

I also see replicate queue errors in the logs around that time. not sure if these are related:

error sending couldn't accept ‹range_id:1257514 coordinator_replica:<node_id:75 store_id:298 replica_id:1 type:VOTER_FULL > recipient_replica:<node_id:105 store_id:418 replica_id:5 type:LEARNER > delegated_sender:<node_id:75 store_id:298 replica_id:1 type:VOTER_FULL > term:8 first_index:91 sender_queue_name:REPLICATE_QUEUE descriptor_generation:10335 queue_on_delegate_len:-1 snap_id:c1b047f5-0a15-4e4e-8754-c309160ad62d ›: recv msg error: grpc: ‹giving up during snapshot reservation due to cluster setting "kv.snapshot_receiver.reservation_queue_timeout_fraction": context deadline exceeded› [code 4/DeadlineExceeded]

Either way this is a kv / allocator issue; we should have rebalanced away from this store (like we did from the other stores) but we didn't, for some reason.

cockroachdb / cockroach

storage: store capacity was 100% though the used cluster capacity was at 60% #134485