Add "Vertical Scalability" section

Gabriella439 / post-rfc

Blog post previews in need of peer review

Creative Commons Attribution 4.0 International

2.2k stars 170 forks source link

Add "Vertical Scalability" section #133

Open nponeccop opened 6 years ago

nponeccop commented 6 years ago

I don't know what is the right name for it, but currently there's no data in SOTU on Haskell performance on larger and more stressed servers. For example:

GC performance on large heaps and strange allocation patterns (e.g. all data are essentially static, or the younger generation is large etc) compared to some 4 allocators of JVM
performance on many-core (some 40+) machines which tend to be NUMA
ability to saturate high-end network cards and drives, e.g.:
- SSDs still require deep queues to get full IOPS, IOPS and/or bandwidth are so high that bottlenecks appear in unexpected places
- apache spark- or MPI-like network load, where a lot of data are moved across 40 gbit links (40GBE, ifiniband, myrinet, rack-wide PCIe switches - whatever)

Bad IO-stack can hurt performance, and so far I've seen only the performance of socket listener and I think there were some works on NUMA-scalability of GHC at Facebook.

Another issue is that if you cannot buy enough RAM (and many guys assume that you always can have 4x ram than you actually need) many strange things happen:

GC requires some empty space, so this RAM is essentially wasted compared to (say) GC-free Rust
You want to pack your data as dense as possible (and GHC tends to have huge overhead because of pointers/thunks/boxes/whatever) - e.g. try to put a lot of key-value pairs into a dictionary and you'll see that GHC uses way too much RAM than Python with naive code, because Python/Perl/whatever hashmaps are extremely tuned over years

Gabriella439 commented 6 years ago

This is an excellent topic choice since this has a huge impact on Haskell's suitability in several corporate environments. Were you interested in writing this up? If not, I can also speak to this a little bit myself

nponeccop commented 6 years ago

this has a huge impact on Haskell's suitability in several corporate environments

exactly

Were you interested in writing this up? If not, I can also speak to this a little bit myself

I'm rather interested in reading it to evaluate suitability of Haskell for my corporate needs :) Currently we use ancient legacy of Perl and C++ with some 32GB of mostly static heaps (because neither has GC and everybody in the house hate Java irrationally). And things like external sorting tend to require low-level features to avoid disk cache pollution etc.

Gabriella439 commented 6 years ago

Alright, I'll dump some unassorted notes of mines soon (most likely tomorrow morning) for you to look over and then based on your feedback I'll write it up into a new section

nponeccop commented 6 years ago

You can use this issue as a draft.

Gabriella439 commented 6 years ago

First off, here are some notes from @j6carey, who has done a lot of performance work on our internal network protocol parser (a process with a large heap that consumes a high rate of network traffic):

At least for multi-threaded programs having mostly short-term and medium-term allocations:

We seem to get the best performance by having extremely large nursery areas (multiple gigabytes per capability, as set by +RTS -A...).

Performance seems to suffer as the number of GHC RTS capabilities increases past about 4; going to multiple processes fixes the problem (if you can do that efficiently).

Reducing the number of GC generations may help.

The amount of extra space you need for a copying collection is about the same as the size of the generation--and that does NOT count the nursery area, which helps.

GHC does not properly return free "megablocks" to the OS until GHC 8.2 (there was an arithmetic bug).

Haskell heap fragmentation becomes a problem if you allocate more than about 4000 bytes per ByteString in the Haskell heap. Try allocating larger ByteString chunks with malloc, though malloc memory can also fragment, and we see better fragmentation control with jemalloc than with glibc malloc.

On Linux 2.4.10+, when reading large streams of data from files, open them with the flag O_DIRECT and do large reads into aligned buffers. This yields higher performance than mmap, and should not pollute the disk cache. Try to read big blocks in one thread and parse in another. (We have not yet open-sourced our module which does this, but we could.)

Look at the Haskell core compilation output for the inner loop of whatever parser first sees the incoming bytes. Very often you can get significant improvements to GC churn and speed in general by avoiding dynamic evaluations/calls and by unboxing integers and pointers (often by means of strictness annotations, rather than direct use of Int#). It also helps to limit the amount of state floating around as unboxed values, since the best speed comes when those values are in registers. Once the volume of data has been reduced a bit by that first layer of parsing, these issues tend to become less critical.

Gabriella439 commented 6 years ago

My first draft of notes:

Space leaks are the biggest source of instability for Haskell services in general
Space leaks can be an even greater problem when you work with data in large blocks. For example, we process packet blocks that are (I think) at least 128 MB large so if you have a single dangling reference to that block it doesn't get freed, but for performance reasons you don't want to prematurely slice and copy the fragments you need. Minimizing copies and dangling references was an issue for us earlier on.
Prefer data structures that don't create a large number of objects on the heap (i.e. prefer Vector over List, Text over String, HashMap over Map). GHC uses a fairly straightforward generational garbage collector that in the worst case scales proportionally to the number of objects on the heap
For long-lived data (i.e. an index for a search engine, for example), you can use compact regions so that the garbage collector doesn't have to traverse that data on each garbage collection (Conceptually: you're reducing all that data to just one node on the garbage collection graph)
Compact regions also have the added benefit of guaranteeing that the data remains fully strict which helps improve stability
GHC provides heap profiling, but as far as I know it's only by size and not by number of objects. Usually you care about the latter when optimizing garbage collection.
Unlike Java there is no way to use a different garbage collection algorithm but there are lots of knobs to tune the existing one. You usually want to tune those for an application with a high memory residency since the defaults are not optimized for that use case (see @j6carey's comments above for more details)
For idle Haskell services on a machine with a large number of cores (i.e. 36), either disable threading (i.e. don't supply -with-rtsopts=-N) or disable the parallel garbage collector (i.e. add -with-rtsopts=-qb -with-rtsopts=-qg, I think). In the 36 core scenario the parallel garbage collector has a high baseline activity (i.e. ~100-200% CPU usage) and disabling it will get you back down to < 1% idle CPU usage. However, I believe the parallel garbage collector is a good default for a heavily loaded process.
For machines with NUMA, the solution we currently use is one process per NUMA node. Naively running one Haskell process across more than one NUMA node significant deteriorates performance. For example, on a machine with two NUMA nodes with 18 cores each, a Haskell process running on 18 cores on 1 NUMA node is faster (in absolute terms) than running on 36 cores over 2 NUMA nodes.
Don't use every single core. The rule of thumb is to always leave at least one core free as this improves performance and stability of the Haskell runtime. I forgot the exact reason why.
Haskell can saturate any interface that C can just using the FFI. The overhead of a safe FFI call is on the order of 100 nanoseconds to 1 microsecond, which is fast enough if you process data in large enough blocks. For example, our network protocol parser use shared memory to communicate with an upstream process that provides the raw packet blocks

j6carey commented 6 years ago

One more thing:

If your parsers suspend pending further input, then beware of making them monadic, as doing so may lead to space leaks. Consider:

parseRecord = do
  header <- liftA2 (,) parseInt parseInt
  len <- parseInt
  bytes <- parseBytes len
  return (header, bytes)

In the rather likely event that parseBytes suspends, the thunks for the two components of header :: (Int, Int) are hidden inside that suspension, with no way to evaluate them or to measure their sizes. Probably they still refer to large ByteString chunks, making their cost disproportionately high. If Alternative is also supported, then things can get buried even deeper, for quite some time.

Perhaps with enough explicit evaluation you could avoid this problem, but in realistically large parsers we have found it very difficult to clean out every last way in which monadic bindings can trigger leaks.

Now, the standard Arrow class hierarchy exhibits the same problem. But one can define variations on the arrow classes that impose class constraints on the input and output types of the arrows. Those class constraints should have methods that normalize and/or measure the otherwise-hidden runtime data. (But an analog of ArrowApply would be dangerous territory; how would one avoid run-time data escaping visibility, as in Monad?)

Another alternative is to keep the span of input that gave rise to the suspension, discard the suspension, and then re-parse the augmented input from the beginning once new data arrive. Of course, this approach involves redundant parsing, but for small records that can be quite acceptable. In fact, the memory footprint may actually be smaller, because most external data formats are reasonably compact in comparison with a parser suspension.

Gabriella439 commented 6 years ago

@nponeccop: Did the initial notes answer your questions or do you still have remaining questions?

nponeccop commented 6 years ago

They answer a different question: "What knobs are available to tune GHC on large machines"

And I'd like to see "where tuned GHC stands compared to tuned Java on large machines". The knobs info is usable, but not in the sotu document. A link in "Educational resources" section is better.

Below is my version of the "first draft":

Vertical scalability

This chapter covers single-server GHC performance consideration for "large" tasks, such as:

large heaps (tens of gigabytes and more)
many cores (generally NUMA machines with 2 or more cpu sockets)
MPI-style fast interconnection (10 Gbps+, low latency, low losses, few connections)

Heap scalability

Only one garbage collector with the usual tuning knobs.

We need a one paragraph overview of https://simonmar.github.io/bib/papers/parallel-gc.pdf or whatever is the current collector, using the terms familiar to Java collector tuners, such as https://www.slideshare.net/jclarity/low-pause-gc-in-hotspot and https://wiki.openjdk.java.net/display/shenandoah/Main

Manycore scalability

GHC scales to millions of green threads, but doesn't scale past 4 cores (underlying OS threads, RTS capabilies in GHC parlance)
General purpose OS threads (outside of the green threads scheduler) are available and scale just as in other languages
Spanning a process across NUMA nodes degrades performance significantly

A solution to both these problems is to run many processes with a message passing IPC.

Network scalability

Haskell's network stack is mostly tuned to the needs of the Web (HTTP and websocket servers with large number of connections but low per-connection bandwidth). But MPI bindings exist.

[haskell-mpi] - MPI-2 bindings (link against any standard MPI implementation)

Disk scalability

Fairly standard Disk IO - blocking file reads and memory-mapped files
No support for fast IO paths (e.g. O_DIRECT or aio) although it can be added through FFI.
No libraries specifically designed for managing outstanding IO queues, although rich library of general purpose concurrency primitives and data structures makes it easy

Gabriella439 commented 6 years ago

GHC scales to millions of green threads, but doesn't scale past 4 cores

What is this based on? I'm pretty sure GHC's concurrency runtime and garbage collection will scale past 4 cores. I could be wrong, though, because I haven't formally benchmarked this or studied this recently.

Fairly standard Disk IO - blocking file reads and memory-mapped files

I think it's important to clarify here that "blocking" means that it only blocks the current green thread, not the underlying OS thread or the runtime

Also, one thing that needs to be highlighted quite prominently is that, unlike Java or Go, the Haskell runtime can wrap userland FFI calls to C to be "safe" (meaning that they will also only block the current green thread) at the expense of less than 1 microsecond per call. I'm not aware of any other language that provides this guarantee and I think this is one of the killer features of Haskell in this domain since the absence of this feature is a common stability pitfall on the JVM.

The only thing that will ever block an OS thread or the runtime is an "unsafe" C call (which has much lower overhead on the order of nanoseconds, designed for very quick calls).

No support for fast IO paths (e.g. O_DIRECT or aio) although it can be added through FFI.

I would suggest removing the reference to aio because the Haskell runtime is already asynchronous under the hood (that's how IO, green threads, and the safe FFI calls are implemented) so there's no need for userland asynchronous IO. Conceptually, every IO action is equivalent to what other languages would call a "promise" and do notation for IO actions is just chaining promises.

Finally, another thing that needs to be highlighted prominently is that Haskell's concurrency is preemptive, not cooperative, which is another big benefit for stability in this domain.

nponeccop commented 6 years ago

I'm pretty sure GHC's concurrency runtime and garbage collection will scale past 4 cores.

My assumption was based on this quote:

Performance seems to suffer as the number of GHC RTS capabilities increases past about 4

Remember, I don't know what I'm talking about. It's just the type of things I'd like to see correctly described in the document (but without turning the document into a tutorial)

"blocking" means that it only blocks the current green thread, not the underlying OS thread or the runtime

The API visible to end-users is blocking and threads. It's what I meant. There is no event loop/async monad style api because we have really good green threads.

As for blocking all the threads - what happens if the number of green threads blocked by IO operations exceeds the number of RTS capabilities? Do calls that cannot be implemented without blocking use some thread pool independent of the capabilities or?

It seems that the degree of non-blocking in GHC runtime cannot be explained here concisely, so we need to find a link.

I would suggest removing the reference to aio because the Haskell runtime is already asynchronous under the hood

Ah, it's a common misconception about aio. aio is a set of faster syscalls designed for applications that need to scale well, not merely a variant of libuv where old syscalls are used in a threadpool. It's sort of further performance improvement of O_DIRECT:

https://www.kernel.org/doc/ols/2003/ols2003-pages-351-366.pdf

Unfortunately a definitive guide on aio is hard to find: most people don't understand storage or syscalls and just assume that it "just works". Here is a suggestion that it helped in case of mysql: https://lists.mysql.com/benchmarks/154 (but the link from there is broken). So you need to find a hardcore C storage guru to see if aio is beneficial for your application. I didn't test aio on linux, but on Windows a similar facility improves performance for long disk queues (and less than 2 outstanding IOs is generally a bad idea even for sequential access).

Haskell's concurrency is preemptive

I have always thought that sparks cannot be preempted in the middle. Can you find a reference?

Conceptually, every IO action is equivalent to what other languages would call a "promise"

I think green threads describe the spark scheduler better than the async monad.

Also it seems that we need a link that describes the spark concurrency model, as it's pretty unique native code green threads solution.

Gabriella439 commented 6 years ago

As for blocking all the threads - what happens if the number of green threads blocked by IO operations exceeds the number of RTS capabilities? Do calls that cannot be implemented without blocking use some thread pool independent of the capabilities or?

The following paper is essential reading for this discussion:

Mio: A High-Performance Multicore IO Manager for GHC

The reason I believe GHC can scale past 4 cores is in the abstract of that paper:

We also show that with Mio, McNettle (an SDN controller written in Haskell) can scale effectively to 40+ cores, reach a throughput of over 20 million new requests per second on a single machine, and hence become the fastest of all existing SDN controllers

The short answer to your question is that GHC can have many more blocking threads than RTS capabilities (i.e. you can have millions of threads blocking if you want), but the more detailed answer is to read the above paper.

Also, note that there appears to be a Haskell binding to aio, but I haven't tried it:

https://hackage.haskell.org/package/posix-realtime-0.0.0.4/docs/System-Posix-Realtime-Aio.html

When I say that Haskell concurrency is preemptive I mean that (A) you don't need to insert explicit yield points in userland code and (B) it's rare for a thread to not automatically yield (the only scenario I'm aware of is a CPU-intensive loop that never allocates memory). The required reading here is the Control.Concurrent module:

https://hackage.haskell.org/package/base/docs/Control-Concurrent.html

Another useful resource is:

Parallel and Concurrent Programming in Haskell

I also think it is important to distinguish between sparks and green threads. Those are two separate abstractions and RTS features. Conceptually, everything in Control.Concurrent (i.e. forkIO, killThread) is what people mean by concurrency and green threads in Haskell and everything in the parallel library (i.e. par or parList) is what people mean by parallelism and sparks. What I've been discussing so far is concurrency and green threads, not parallelism and sparks.

j6carey commented 6 years ago

"Performance seems to suffer as the number of GHC RTS capabilities increases past about 4" was based on experience with only a single program--hence the "seems to". It is a real program doing lots of real work, but still, other programs would probably scale differently.

Furthermore, when staying on a single NUMA node I have not (yet) seen more than a 13% throughput difference between 3 processes with 4 capabilities and 1 process with 12 capabilities. So if there is much difficulty in going to a multi-process scenario, or if it is difficult to balance the load evenly when doing so explicitly, then you might still be better off with a single process that can easily redirect capabilities to available work. Ideally one would experiment with the particular application in question to see what works best.

Regarding green threads and blocking: according to the documentation and what I have seen in practice, each "capability" has a pool of threads, most of which are dormant at any given time. When the active thread blocks in a system call, another thread takes over for that capability, so that the CPUs stay busy.

Now, if there is a way to use aio libraries to queue up more than one IO read at a time, so that when the first finishes the second starts without waiting for the kernel scheduler to schedule a user-level request for it, then that sounds rather interesting. Short of that, green threads should do everything, at least in principle.

nponeccop commented 6 years ago

Now, if there is a way to use aio libraries to queue up more than one IO read at a time, so that when the first finishes the second starts without waiting for the kernel scheduler to schedule a user-level request for it, then that sounds rather interesting. Short of that, green threads should do everything, at least in principle.

It's not kernel queue, but hardware queue in the disk controller - SAS TCQ or SATA NCQ. You can fill the hardware queue so the controller is always busy either with normal thread pool or with aio - it doesn't matter. The only difference is CPU utilization (and that aio is limited and provides neither caching nor full POSIX semantics). See http://dba-oracle.com/real_application_clusters_rac_grid/asynchronous.htm

And yes, TCQ/NCQ is indispensable both for SSD and HDD, for different reasons, and its job cannot be done by the kernel side IO scheduler.

I also think it is important to distinguish between sparks and green threads. Those are two separate abstractions and RTS features.

I didn't know that. So sparks (and parallel programming in general as opposed to mere IO concurrency) should be covered separately in the document.