Add Serialization + Persistence JMH Benchmark

original-brownbear commented 7 years ago

@colinsurprenant @jakelandis @suyograo @jordansissel

As discussed we should have a serialization + subsequent disk write benchmark suit.

The way I suggest setting this up is:

Create a few M events on heap and in an Event[] array (we shouldn't have any overhead from storing these, the more we can get on heap and actually run through in the benchmark itself, the better => array imo, we can do multiple arrays if we want to run multiple threads for input to the Queue)
Try to use events of realistic size and "shape". Our Apache log dataset seems like a good bet here.

This is basically just an extension of org.logstash.benchmark.QueueWriteBenchmark that runs off of an array of events :)

colinsurprenant commented 7 years ago

Good idea. I would also like to see this test signatures: write/read from same file (to mimic the faster consumers that producers where queue is in always-empty state)

The other possible test would be" write in one file and read from another file (to minic faster producers than consumers where queue is growing and piling pages) but not sure this is actually relevant since one page will test write-only throughput and the other read-only.

original-brownbear commented 7 years ago

@colinsurprenant @jakelandis @jordansissel the problem with throughput measurements is, that they are very implementation dependant.

I mean in theory (ignoring CPU) we're only bound by available IOPS/write bandwith and available memory.

So if I do a throughput benchmark it looks to me like I can assume that:

If I have unlimited memory throughput is as fast as just writing since I never have to deserialize (I can always simply keep the objects that I persist for safety reasons on heap as well and dequeue them once they are persisted to disk)

So the deserialization step will only occur as a result of using up too much memory, even at ack==1.

This means with strong implementations of mmap vs fwrite and limited memory, all we'd be benchmarking is whether or not mmap can translate using more memory in the form of page cache into faster writes (compared to a standard fwrite) while still maintaining a non-depleted buffer of deserialized objects in memory.

Looking at it for the 3 different queue states you get this:

Consumers are faster than producers => Queue is empty and deserialized objects can be kept on heap => Throughput == Write Rate since you never have any deserialization
Consumers and producers are about the same speed on average => deserialization cost becomes purely a function of how often your output buffer of deserialized objects on heap depletes
Producers are faster than consumers => you're effectively doomed in the long run :) Short term you'll be running into the speed of sequential read + deserialize here.

So for any I/O implementation to win out in a throughput benchmark it must either:

Physically write faster to improve upon the situation in the first state
Or consume less memory to allow for a larger buffer of objects to keep prevent the second state requiring deserialization at a lower frequency

-> there is no point in micro-benchmarking anything but the full end-to-end implementation when it comes to measuring throughput right? Just measuring sequential performance won't tell us much :(

In the end, if you see equal throughput for Channel and MMap, the implementation that requires less memory for its I/O layer will win out by being able to use that memory for queueing more objects on heap.

jakelandis commented 7 years ago

This means with strong implementations of mmap vs fwrite and limited memory, all we'd be benchmarking is whether or not mmap can translate using more memory in the form of page cache into faster writes (compared to a standard fwrite) while still maintaining a non-depleted buffer of deserialized objects in memory.

I think we are coming to the same conclusion as proposed in this comment. That we largely end up testing the hardware + kernel and thus the mmap vs. channel IO performance should not be used to drive the design decision.

The motivation for the hacky benchmark tool I wrote (which @jordansissel cleaned up/fixed/extended) was to try to quantify the impact this statement

The fact that we want to fsync many smaller writes changes the scenario a lot.

Perhaps that statement should be refined to "If we want to fsync many larger writes (over 1K), then the scenario a changes a lot" ? (based on some the kafka testing ?)

original-brownbear commented 7 years ago

@jakelandis

mmap vs. channel IO performance should not be used to drive the design decision.

I tend to agree at this point. I think the best move forward would be to do what for example Cassandra, SQLite, MemSQL and others did and make the File I/O pluggable to the extent that MMap and Channel I/O live behind the same API. Then people can simply choose what works for their hardware. At this point probably a more efficient use of our time to start refactoring in that direction instead of trying to benchmark the details further? :)

original-brownbear commented 7 years ago

Perhaps that statement should be refined to "If we want to fsync many larger writes (over 1K), then the scenario a changes a lot" ? (based on some the kafka testing ?)

I don't think this really changes anything besides the fact that it takes away the perceived advantage of having one copy fewer, which seems to be meaningless in this day and age if you look at benchmarks. Even Linus says so http://lkml.iu.edu/hypermail/linux/kernel/0802.0/1496.html :)

colinsurprenant commented 7 years ago

Is quoting Linus the technical version of Godwin's law 😂

So I get what you are saying @original-brownbear. At some point we did use such a caching scheme to keep persisted objects in memory to avoid deserialization and it is a totally valid avenue to explore in this design too. We did not get to that as we wanted to focus on the minimal design first, etc.

In the queue state you mention, having «consumers and producers are about the same speed on average» does not really happen in practice. I believe the practical queue states we see in LS are:

"always empty": consumers faster than producers
"increasing in size": consumers slower than producers
"decreasing in size" : consumers faster than producers after a period of "increasing in size"
"always full": consumers slower than producers + queue size limit is reached. Back pressure is being applied to inputs.

original-brownbear commented 7 years ago

@colinsurprenant

Is quoting Linus the technical version of Godwin's law

It does seem that way the closer you get to Kernel topics :P

consumers and producers are about the same speed on average» does not really happen in practice

well, in the long run, this is exactly what happens, whether we like it or not (unfortunately we don't :(). The question is just if it happens as a manifestation of "always empty" or "always full".

Imo "always full" is a broken state if it's not temporary right? I mean it literally means your system is in a state where it's worse off than had it not buffered, because now the throughput is lower than it would have been without buffering and the problem that buffering should have avoided, was simply delayed a little to come back worse than before later (since you're now blocking at the inputs again). I admit I don't have any solution to offer to this problem in a single node scenario though :(

That said temporarily "increasing in size" is what needs to be transparent from the output throughput perspective. As long as your disk persistence can deal with the input write rate you simply need a buffer of size:

S = LONGEST_EXPECTED_SYNC_DELAY * OUTPUT_RATE

and you're fine. I think bringing about a situation code-wise where this basic relation becomes explicit will make it very easy to judge the correct I/O strategy.

colinsurprenant commented 7 years ago

Imo "always full" is a broken state if it's not temporary right?

Well, it might be broken operationally speaking depending on your use-case but it is a totally valid state. Using our bounded in-memory queue is exactly what happen when consumers are slower than producers, it then just apply back pressure to the producers at the risk of loosing in-flight data. You'd hope this is a temporary situation but but it could be reasonable in other context. One could decide to use PQ to avoid that in-flight data loss and continue to rely on the back pressure behaviour.

original-brownbear commented 7 years ago

One could decide to use PQ to avoid that in-flight data loss and continue to rely on the back pressure behaviour.

But this is wrong if the queue stays full. Assuming the full queue is slower than the legacy (non-buffering queue), you're outright creating more pressure in your pipeline that propagates upstream. You're worse off than before from every angle:

Each new event is enqueued with a higher latency than in the legacy queue
You waste disk/memory and CPU
end-to-end latency goes up
throughput goes down

So assuming that this is a permanent state you will get persistence for the first k events at the expense of above drawbacks for all events that are enqueued after the buffer has filled up. I don't think that should be a valid state ever? (not that we can prevent it yet)

colinsurprenant commented 7 years ago

But this is wrong if the queue stays full

It is not wrong. It is a valid state we must support. The fact that the queue is full and stays full for a duration we don't control for use-cases we don't control is not something we are here to really judge but to make sure it is correctly supported and handled by LS.

All the drawback you are listing are pretty much irrelevant in the context of a system that for some reasons has slower outputs.

Again, having a system that has reached a queue-full situation is a totally valid state, like it or not. For how long and for what reasons this is happening is totally out of our control.

jordansissel commented 7 years ago

But this is wrong if the queue stays full.

This scenario (consumer perpetually slower than producer) is very similar to the way most large transfers over TCP end up behaving. When you download a large file, the throughput your connection gets is largely limited by your own connection (ISP, whatever), and not limited by the actual server sending you the file. This analogy with TCP is where the sliding window is nearly always closed and the connection nearly always is blocked for writing (full queue) and the reader nearly always has data available.

In this TCP analogy, it is not "wrong" to have a consumer that consumes slower than the producer can produce. Back for Logstash, it is not "wrong" to have a consumer (a downstream output, a filter, etc) that cannot process as fast as the produce is capable of producing.

original-brownbear commented 7 years ago

@jordansissel

Back for Logstash, it is not "wrong" to have a consumer (a downstream output, a filter, etc) that cannot process as fast as the produce is capable of producing.

Of cause, that's not wrong, but if this is a perpetual state then in that specific case PQ is actually 100% useless and counterproductive in the long run (after it fills up initially) because it actually increases the backpressure in your system and doesn't buffer anything that the legacy queue wouldn't have passed through quicker anyways.

The analogy would be buffering tcp reads to a buffer that reading bytes from is slower than reading bytes from the socket?

original-brownbear commented 7 years ago

@jordansissel or put differently, for a perpetually full queue, wouldn't the best possible optimization simply be to recognize that state, drain the queue and fall back to the SynchronousQueue? If not, why not?

original-brownbear commented 7 years ago

sorry for derailing this btw, we moved from benchmarking to queueing to this corner case pretty quickly :D

jordansissel commented 7 years ago

for a perpetually full queue, wouldn't the best possible optimization simply be ... fall back to the SynchronousQueue?

There are functional differences between the PQ and in-memory queue that are not about speed.

The old in-memory queue is not persisted and therefore, when Logstash or its host is restarted, any in-memory queued messages are lost basically forever (unless some replay mechanism is available. Replay capability of arbitrary input sources is uncommon).

PQ vs in-memory, as for what we promise to users, means exactly this and no more:

The PQ will not lose data upon temporary Logstash or machine failures.
The in-memory queue will absolutely lose its entire contents plus all in-flight (post-queue) events,

The contents of the queue, and the rate at which things enter and leave the queue, are unrelated to the durability promises we make when a user sets queue.type: persisted

As for a perpetually-full large queue, I don't expect that a perpetually-full queue should cause any kind of performance issue. My expectation here is the same for both in-memory and persisted. A large, perpetually-full, in-memory queue has the same characteristics as a large, perpetually-full persisted queue -- at least, how I am trying to think about it. The performance cost difference is just that one writes to disk and the other does not.

We have the same "perpetually full" synchronous queue (in memory) scenarios for many users today.

If anything, I wonder -- is your concern that a very-large-and-full queue will take a long time to drain? I believe this is up to the user to decide, and Logstash can provide feedback about fullness (as it does today regardless of queue.type with backpressure)

original-brownbear commented 7 years ago

@jordansissel

The old in-memory queue is not persisted and therefore, when Logstash or its host is restarted, any in-memory queued messages are lost basically forever

Urgh you're right we have the acknowledgment at the end, sorry missed that one :( nevermind this point then.

As for a perpetually-full large queue, I don't expect that a perpetually-full queue should cause any kind of performance issue.

I only saw the problem of basically losing the ability to handle input load spikes at this point and actually being worse of than without the persistence layer in this regard. But since you cleared up point 1 this seems like a more valid tradeoff than before now :)

That said, I'm not sure what I can benchmark here at this point. This is heavily dependent on what you would implement for the write-side, this goes way beyond Channel vs mmap and as we found out already, the current implementation doesn't really allow for porting it to Channel reads. I could benchmark the write side against that of https://github.com/original-brownbear/logstash/blob/fast-spilling-queue/logstash-core/src/main/java/org/logstash/persistedqueue/PersistedQueue.java here, but just benchmarking sequential serialize then mmap vs. sequential serialize and write to Channel seems very pointless?

elastic / logstash

Add Serialization + Persistence JMH Benchmark #7498