OOM caused by too many logs to be processed in the raft queue

heidawei commented 4 years ago

Describe the problem

Please describe the issue you observed, and any steps we can take to reproduce it: CockroachDB cluster 16 nodes, one store pre node, 3TB Capacity pre store。 One node rocksDB write Blocking due to too many L0 files，there are more than 50000 REPLICAS for every node。then the node OOM again and again。 show top10 function of heap: show the used memory:

show the SSTables

I see there are too many raft heartbeat request in the reft queue from heap pprof, but can not bee deal since the pre raft request was blocked in the RocksDB

To Reproduce start cockroachDB(I limit the rocksDB cache memory) ./cockroach start --insecure --advertise-addr={host ip} --cache=0.05 --store=path={data path},rocksdb="block_based_table_factory={cache_index_and_filter_blocks=true;pin_l0_filter_and_index_blocks_in_cache=true}",attrs=stats --background

400 concurrent writes, 1KB per record， total 10,000,000,000 records.

What did you do? Describe in your own words.

If possible, provide steps to reproduce the behavior:

Set up CockroachDB cluster ...
Send SQL ... / CLI command ...
Look at UI / log file / client app ...
See error

Expected behavior A clear and concise description of what you expected to happen.

Additional data / screenshots If the problem is SQL-related, include a copy of the SQL query and the schema of the supporting tables.

If a node in your cluster encountered a fatal error, supply the contents of the log directories (at minimum of the affected node(s), but preferably all nodes).

Note that log files can contain confidential information. Please continue creating this issue, but contact support@cockroachlabs.com to submit the log files in private.

If applicable, add screenshots to help explain your problem.

Environment:

CockroachDB version: V19.2.4
Server OS: Centos 7.6
Client app :JDBC

Additional context What was the impact?

Add any other context about the problem here.

blathers-crl[bot] commented 4 years ago

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

@florence-crl (member of the technical support engineering team)

If we have not gotten back to your issue within a few business days, you can try the following:

Join our community slack channel and ask on #cockroachdb.
Try find someone from here if you know they worked closely on the area and CC them.

_{:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

petermattis commented 4 years ago

One node rocksDB write Blocking due to too many L0 files

@heidawei RocksDB has pathological compaction behavior in certain circumstances that leads to a pile-up of files in L0 and Lbase (the first non-empty level below L0). This causes poor performance, but is almost certainly not the cause of the OOMs.

Note that 3TB of data per node is significantly larger than what we typically recommend. The soon to be released 20.1 will improve the situation here, but not completely solve it.

Tagging @tbg for KV triage on the OOM.

heidawei commented 4 years ago

yes, too many files in L0 is not the direct cause, A lot of backlog of raft heartbeat requests can't be processed, which should be the reason why go memory keeps rising. We can see from the heap pprof that a lot of memory is in the queue processing function of raft heartbeat requests

heidawei commented 4 years ago

"Note that 3TB of data per node is significantly larger than what we typically recommend. The soon to be released 20.1 will improve the situation here, but not completely solve it."

What is the recommended capacity per node ???

tbg commented 4 years ago

@heidawei do you have logs for this cluster? Concretely I'm interested in whether node liveness heartbeats are timing out. The inverted LSM shape can act essentially like a disk stall, blocking writes and causing a pile-up of work which can ultimately lead to the OOM. In particular, it can prevent Raft groups (=Replicas) from going into quiesced mode (where they require no heartbeats) so you end up in a state where all of your 50k+ replicas are continuously heartbeating. However, the fact that you're reporting only one node having that issue suggests that the problem might be more localized, i.e. only the node itself gets overloaded by requests enough to exceed available memory.

Am I reading your message correctly in that you have 400 clients sending 1kb-sized writes as fast as possible? Are these writes distributed or are there ranges that act as hot spots?

My first suggestion, assuming you want to get out of this situation ASAP, would be taking the node down and compacting the LSM offline, via the ./cockroach debug compact command (@petermattis any concerns with this approach since this cluster is running pebble?).

I'll defer to my colleagues for recommending a suggested capacity per node.

knz commented 4 years ago

What is the recommended capacity per node ???

Today we have good reports from other users using less than 1TB per node or about 15000 replicas. It's possible to store more in a cluster by increasing the number of nodes. However, there may be some issues when the total number of ranges reaches 100k-200k or above.

All in all your deployment exceeds the capacity at which we typically test CockroachDB. You're literally stretching the limits. It is very interesting for us to see your experience, we can learn a lot from it. However, it is not surprising that you encounter some issues.

heidawei commented 4 years ago

@tbg Thank you for your reply。 I have a pressure test program, which writes to the cluster at 400 concurrent times as fast as possible, TPS 80K +, and there should be no hot sharding, because I write randomly. My intention is to test the performance of CRDB on the mechanical disk to verify the boundary of a safe operation space. When oom occurs, only L0 files of this node have a backlog. Before oom occurs, this node has been displayed as "suspend node" on the UI.

Later, another node also became the "suspend node". From the log, it reported various heartbeat timeouts (4.5s) and lease application timeouts. However, there was no file backlog in this node.

If the backlog is caused by hot sharding, there will be a file backlog of three nodes' disks because there are three copies.

From the analysis of heap pprof (I read that part of the code), it is certain that a large number of raft heartbeat requests have not been processed, and they are still in the queue. Considering the serial application log of raft, this speculation can match the result very well.

In terms of disk monitoring, disk IO has reached 100MB / s in this period, which has reached the limit (it has been the same during the pressure test). From the perspective of log monitoring, a large number of snapshot copies have taken place.

Later, I stopped the pressure test and waited a few hours for the cluster to return to normal. The file backlog of the nodes where oom occurred also disappeared. The disk usage of each node in the cluster is about 200GB.

I need to spend a little time looking for the cluster logs, because the cluster has been destroyed now.

heidawei commented 4 years ago

In addition, I reset two parameters of load balancing

heidawei commented 4 years ago

Sorry, the cluster has been destroyed, unable to find the log, I added more screenshots of the communication at that time

ajwerner commented 4 years ago

My intention is to test the performance of CRDB on the mechanical disk to verify the boundary of a safe operation space.

Mechanical disks are not a well tested medium with cockroach. Cockroach eagerly uses iops. There is a cluster setting which can help to reduce the number of fsyncs:

rocksdb.min_wal_sync_interval

It defaults to 0. Given a mechanical disk only offers at most ~200 random IO operations per second, it might be valuable to set this setting over 1ms and maybe even as high as 10ms on a spinning disk. This will have obvious implications for write latency.

ajwerner commented 4 years ago

You also should turn the snapshot rate down to somewhere substantially below the maximum throughput of the underlying storage device.

heidawei commented 4 years ago

@ajwerner I wonder if we can provide parameter suggestions under different media disks, such as snapshot transfer rate, cache limit(In the official example, cache = 0.25, but in the actual communication, we are told that the proportion is too high). The default 8MB seems to be a little small. When a node crashes, I think it's more important to complete fault recovery as soon as possible than to provide services. Although we want to reduce the impact on services, the longer the time of fault recovery, the worse the stability of the cluster. I think this scenario is different from the normal scenario. The normal scenario can be done slowly to reduce the impact on the service.

heidawei commented 4 years ago

Today we have good reports from other users using less than 1TB per node or about 15000 replicas. It's possible to store more in a cluster by increasing the number of nodes. However, there may be some issues when the total number of ranges reaches 100k-200k or above.

@knz This information is very important to me. The reason why we research CRDB is that we expect it can solve the storage and calculation problems of massive data. Therefore, how large-scale data it can solve is one of the key points of our research. According to a range For 64MB computing (ignoring the invalid historical version of mvcc), the total cluster effective data is about 6tb ~ 12tb. Considering the three replica configuration, the total cluster size is about 18tb ~ 36tb. A node maintains a total disk capacity of 500GB, considering the margin, and actually uses 300gb, then the maximum scale of the cluster is 60 nodes ~ 120 nodes. Based on a row of data about 1KB, it can store 6,000,000,000 ~ 12,000,000,000 rows of data.

heidawei commented 4 years ago

It defaults to 0. Given a mechanical disk only offers at most ~200 random IO operations per second, it might be valuable to set this setting over 1ms and maybe even as high as 10ms on a spinning disk. This will have obvious implications for write latency.

when rocksdb.min_wal_sync_interval set to 1ms or 10ms，is it safe or can be recovered when node crash, and restart again. I think may be we need drop the crash node, and add a new node, My plan is feasible ??？. I mean it's difficult to verify that node crash causes data loss. Because rocksdb's wal sync is decoupled from the service (raft & & kV), it's uncertain whether the ability of raft can be used to recover the lost data of this node, if it has been lost. So I think it seems a safer way to simply lose this node and add a new one

ajwerner commented 4 years ago

when rocksdb.min_wal_sync_interval set to 1ms or 10ms，is it safe or can be recovered when node crash, and restart again.

That setting does not generally affect snapshots. It generally just applies to foreground traffic.

I think may be we need drop the crash node, and add a new node, My plan is feasible ??？.

This is somewhat reasonable as a mitigation for #37906. It's unfortunate as in most realistic workloads for this volume of data are likely to not have too much traffic to all ranges. I don't understand the rest of your comments that follow the quoted text.

heidawei commented 4 years ago

I don't understand the rest of your comments that follow the quoted text.

@ajwerner I have looked at this issue #37906. carefully, but it's not my concern. My doubt is that when I set rocksdb.min \ wal \ sync \ interval = 1ms or 10ms, if a node crashes, then whether I restart this node is safe, that is, there will be no data loss. Because this setting has the risk of data loss for rocksdb, if there is a loss, when the service is restarted, whether the data can be recovered according to the original raft consensus algorithm.

tbg commented 4 years ago

@heidawei the wal_sync_interval setting does not put your data at risk since the sync attempts will only ack after the sync has happened. You are thus trading better batching of writes for a higher minimum latency for I/O.

heidawei commented 4 years ago

@tbg Thank you, I read the relevant code, My doubts have been dispelled。

knz commented 4 years ago

We've had an internal discussion about your setup:

Regarding the OOM error: your deployment uses 64 cores with 192 GB RAM, that's 3GB/core, somewhat below our emerging practice of 4GB/core.
Regarding the peaks. You mentioned you use mechanical disks. Unless you're using enterprise-grade SAS 10K, we suspect you may be using 5400 or 7200 RPM SATA.

This means that your setup is possibly bottlenecked on IOPS more than MB/s. The GB/node capacity limit is then not applicable in your case, and it's likely that the problems you're encountering are due to under-provisioned IOPS.

If you want to investigate this further I'd recommend you share with us more details about the hard drives: manufacturer specs + measured IOPS on the filesystem (i.e. not raw block I/O - you need to take into account what the OS does with it). Both random and sequential r/w workloads need to be taken into account.
Finally, if you use more than 64GB of RAM on a single node, it's also likely that you're using a NUMA platform. Neither CockroachDB nor the technology it's built with (Go, RocksDB) contain suitable optimizations to create affinity between memory regions and threads. In practice, this means that performance does not scale well with the number of cores/RAM. When there are peaks in the number of active goroutines (e.g. in the case of L0 contention), you can expect CPU/RAM bottlenecks due to contention on the memory bandwidth.

Hope this helps.

heidawei commented 4 years ago

Disk /dev/sdb: 4000.2 GB, 4000225165312 bytes, 7812939776 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 262144 bytes / 262144 bytes

time dd if=/dev/zero of=/media/diskpool/dd.log bs=8k count=100000 100000+0 records in 100000+0 records out 819200000bytes(819 MB) copied，0.511894 s，1.6 GB/s

real 0m0.513s user 0m0.013s sys 0m0.498s

time dd if=/dev/zero of=/media/diskpool/dd.log bs=8k count=100000 oflag=sync 100000+0 records in 100000+0 records out 819200000字节(819 MB)copied，130.917 s，6.3 MB/s

real 2m11.035s user 0m0.025s sys 0m8.118s

time dd if=/dev/zero of=/media/diskpool/dd.log bs=8k count=100000 oflag=dsync 100000+0 records in 100000+0 records out 819200000bytes(819 MB)copied，132.057 s，6.2 MB/s

real 2m12.199s user 0m0.052s sys 0m7.804s

time dd if=/media/diskpool/dd.log of=/dev/null bs=8k count=100000 iflag=direct 100000+0 records in 100000+0 records out 819200000bytes(819 MB)copied，3.52908 s，232 MB/s

real 0m3.530s user 0m0.014s sys 0m0.698s

time dd if=/media/diskpool/dd.log of=/dev/null bs=8k count=100000 100000+0 records in 100000+0 records out 819200000bytes(819 MB)copied，0.328229 s，2.5 GB/s

real 0m0.329s user 0m0.007s sys 0m0.258s

time dd if=/media/diskpool/dd.log of=/media/diskpool/of.log bs=8k count=100000 100000+0 records in 100000+0 records out 819200000bytes(819 MB)copied，0.675913 s，1.2 GB/s

real 0m0.677s user 0m0.011s sys 0m0.664s

@knz This is the disk info。May be I need Cgroup to Avoid NUMA。

knz commented 4 years ago

1) These are sequential I/O performance - I would be surprised if these drives could sustain these rates with random i/o

2) you can see ~6MB/s with the synchronous flag. This is also what CockroachDB uses—we do not support direct i/o. So this is a smoking gun that your setup is IOPS constrained.

ghost commented 4 years ago

What is the recommended capacity per node ???

Today we have good reports from other users using less than 1TB per node or about 15000 replicas. It's possible to store more in a cluster by increasing the number of nodes. However, there may be some issues when the total number of ranges reaches 100k-200k or above.

All in all your deployment exceeds the capacity at which we typically test CockroachDB. You're literally stretching the limits. It is very interesting for us to see your experience, we can learn a lot from it. However, it is not surprising that you encounter some issues.

What's the relation with the hardware? Any hardware config can only support 1TB?

knz commented 4 years ago

(disclaimer: the rest of my response is general and may not apply to your situation specifically)

Some code paths in crdb are constrained by single core performance, and some others by the bandwidth of individual memory controllers. These typically do not increase when you add more cores or more memory. In fact, with Intel based systems you can even starve your cores of memory bandwidth past certain core counts.

We have observed in the past that the capacity increase of cockroachdb "tapers off" beyond certain core counts for this reason, and adding more nodes only helps increase total "cold" storage capacity, not "hot" (i.e. query performance can decrease if the amount of hot data increases beyond a certain point)

Theoretically it may be possible to break these limits using different hardware architectures, maybe IBM POWER could help, but we never tried it.

To be honest the better improvement would come from optimizing the software not the hardware. We are making plans to do just that in future releases. -- Verstuurd vanaf mijn Android apparaat met K-9 Mail. Excuseer mijn beknoptheid.

ghost commented 4 years ago

Which command to show the sstables stats?

heidawei commented 4 years ago

@wangchuande You can find It in the logs/cockroach.log

ajwerner commented 3 years ago

Okay, so, I think it's clear that we're dealing with an issue of compaction falling behind leading to the buildup of outstanding raft requests and ultimately the OOM. It's also possible that subcompactions in pebble in 20.2 will help improve the write amplification.

This issue doesn't feel actionable presently and so I'm going to close it.

cockroachdb / cockroach

OOM caused by too many logs to be processed in the raft queue #48085