facebook / rocksdb

A library that provides an embeddable, persistent key-value store for fast storage.
http://rocksdb.org
GNU General Public License v2.0
28.59k stars 6.32k forks source link

scaling is not linear per CPU #619

Closed ekg closed 6 years ago

ekg commented 9 years ago

I have a read-bound application (aligning DNA reads to a graph reference genome). I notice that I can align about 1000 reads/second on a single CPU. However, using 32 provides only 8000 reads/second. Obviously there is some contention preventing near-linear scaling. Is it possible this is driven by rocksdb itself? Or am I seeing some kind of system-level contention?

In principle, what might be limiting the performance when I increase the number of threads? Is there any locking in read-only mode? For instance, is there a shared cache? I believe I've set up cache sharding correctly but I am not sure.

ekg commented 9 years ago

If interested my rocksdb code is @ https://github.com/ekg/vg/blob/master/index.cpp

haneefmubarak commented 9 years ago

Try systematically going from 1, 2, 3 ... 32 cores. See how many reads / second you can achieve in each scenario and then post the results here. What hardware are you running / testing on?


AFAIK there aren't any 32 core general processors that are widely available yet. Given that I'm guessing that you are either using SMT (hyperthreading) or a multiple processor system. If you are using SMT, you should know that multiple hardware threads don't have the same performance as multiple cores - so while scaling would be linear across cores, it would be sublinear across hardware threads. If you are using multiple processors, then memory accesses would have to travel through an interconnect, which would increase over latency per memory access, thus reducing overall processing throughput.

Alternatively, it is also possible that your hardware is IO starved. This is likely to occur fairly rapidly on HDDs as the disk head needs to seek to various positions before reads or writes can occur. Ideally, this is solved by using SSDs as there is no seek time, although different (albeit less initially noticeable) issues can occur with those. However, even with SSDs, it is possible that there is insufficient bandwidth to fully support the necessary reads and writes. This can be mitigated by having adequate RAM, which allows the OS to cache the data in memory.

mdcallag commented 9 years ago

Is the block cache large enough to cache the database? Is max_open_files large enough to cache all database files?

How many CPU cores do you have? And is that with or without hyperthreading?

What is your workload for which you want better scaling? Read-only? Read-write where each thread does both reads & writes? Or read-write where some threads only do reads? Threads that do reads & writes can get stalled on writes thus limiting the throughput increase with more threads.

I think you should set max_bytes_for_level_base to make sizeof(L1) = 4 X target_file_size_base because sizeof(L1) is 10MB by default and you have a 256M write buffer so the sizeof(L0) files will be 256M assuming no compression.

siying commented 9 years ago

@ekg give options.max_open_files=-1. This will remove the table cache contention. For your 32 cores, how many sockets are they in, and how many physical cores?

siying commented 9 years ago

Our own benchmark shows we scale pretty well up to 16 cores and still serve more requests until 32 cores in a 32 core server: https://github.com/facebook/rocksdb/raw/gh-pages/talks/2014-03-27-RocksDB-Meetup-Lei-Lockless-Get.pdf

We call it scaling to multicores because, there are two sockets, each has 16 cores. It is a reasonable results for a scalable system.

If you have stats on, can you shared us statistics? Or if you made some CPU profiling, we are also happy to take a look.

mdcallag commented 9 years ago

Siying - for which workloads are you making that claim? 1) read-only with cached database 2) read-only with not-cached database and enough IO throughput 3) read-only with not-cached database and not enough IO throughput 4) not read-only with cached database 5) not read-only with not cached database and enough IO throughput 6) not read-only with not cached database and not enough IO throughput

And then for the poster, what is the workload?

For 4, 5 and 6:

For 1, and maybe 2 you should be happy with concurrent performance assuming there are enough shards for the block cache LRU and max_open_files is large enough. By definition 3 doesn't scale because IO is the bottleneck.

Read-write workloads are interesting. If there are read-only threads independent of the read-write or write-only threads then the read-only threads should scale for cases 1 and 2. But if all threads do a mix of reads and writes, then write stalls can limit scaling even though read latencies should be OK.

On Fri, May 29, 2015 at 10:41 AM, Siying Dong notifications@github.com wrote:

Our own benchmark shows we scale pretty well up to 16 cores and still serve more requests until 32 cores in a 32 core server: https://github.com/facebook/rocksdb/raw/gh-pages/talks/2014-03-27-RocksDB-Meetup-Lei-Lockless-Get.pdf

We call it scaling to multicores because, there are two sockets, each has 16 cores. It is a reasonable results for a scalable system.

If you have stats on, can you shared us statistics? Or if you made some CPU profiling, we are also happy to take a look.

— Reply to this email directly or view it on GitHub https://github.com/facebook/rocksdb/issues/619#issuecomment-106883011.

Mark Callaghan mdcallag@gmail.com

siying commented 9 years ago

@mdcallag it's "memory-only" (not backed by persistent storage at all) workload shown by here: https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks it's "read while writing" tests with ever thing only going to memory. As normal db_bench setting: threads don't mix read and write. One writer thread and lots of reader threads.

I should have been more clear with the benchmark setting to avoid confusion.

paultuckfield commented 9 years ago

compile this tool: https://github.com/yoshinorim/quickstack, then use it to take 5-10 samples of stack traces. I have a somewhat fancy script that post processes this output, if it works it'll save you some time, but it might be finicky outside my environment: https://gist.github.com/paultuckfield/4fe3eef495918cc81beb

If your problem is sustained, that output will reveal exactly what is going on. If you have trouble, send me either the output of that script, or the raw stack traces.

gfosco commented 6 years ago

Closing this via automation due to lack of activity. If discussion is still needed here, please re-open or create a new/updated issue.