elastic / logstash

Logstash - transport and process your logs, events, or other data
https://www.elastic.co/products/logstash
Other
14.18k stars 3.5k forks source link

Persisted Queue Performance Design Issue (Writing Data) #7317

Closed original-brownbear closed 4 years ago

original-brownbear commented 7 years ago

The current implementation in master uses mmap to write to fixed size pages that are created and written to sequentially. This introduces a lot of overhead for the VM system, requires hacking around the GC's Cleaner behavior for the mmaped buffers and hence comes with unpredictable and suboptimal performance.

https://github.com/elastic/logstash/pull/7316 contains an extremely trivial implementation illustrating the problem.

It simply writes to the channel directly and avoids mmap for writes. The durability features of the approach are exactly the same as those in the master implementation since MappedBuffer.force() is simply replaced by calling Channel.force() on the underlying FileChannel.

The branch passes all tests and the performance difference is about a factor of 3 (plus a lot more consistent throughput).

Before (master):

# Run progress: 0.00% complete, ETA 00:00:01
# Fork: 1 of 1
# Warmup Iteration   1: ERROR StatusLogger No log4j2 configuration file found. Using default configuration: logging only errors to the console.
78.110 ops/ms
# Warmup Iteration   2: 72.397 ops/ms
# Warmup Iteration   3: 54.313 ops/ms
Iteration   1: 42.278 ops/ms
Iteration   2: 71.426 ops/ms
Iteration   3: 66.277 ops/ms
Iteration   4: 55.886 ops/ms
Iteration   5: 45.107 ops/ms
Iteration   6: 57.468 ops/ms
Iteration   7: 59.576 ops/ms
Iteration   8: 54.282 ops/ms
Iteration   9: 45.083 ops/ms
Iteration  10: 50.365 ops/ms

Result "org.logstash.benchmark.QueueBenchmark.pushToPersistedQueue":
  54.775 ±(99.9%) 14.293 ops/ms [Average]
  (min, avg, max) = (42.278, 54.775, 71.426), stdev = 9.454
  CI (99.9%): [40.482, 69.068] (assumes normal distribution)

# Run complete. Total time: 00:01:57

Benchmark                             Mode  Cnt   Score    Error   Units
QueueBenchmark.pushToPersistedQueue  thrpt   10  54.775 ± 14.293  ops/ms

BUILD SUCCESSFUL

Total time: 2 mins 1.626 secs

After:

# Run progress: 0.00% complete, ETA 00:00:01
# Fork: 1 of 1
# Warmup Iteration   1: ERROR StatusLogger No log4j2 configuration file found. Using default configuration: logging only errors to the console.
156.287 ops/ms
# Warmup Iteration   2: 179.250 ops/ms
# Warmup Iteration   3: 177.683 ops/ms
Iteration   1: 188.793 ops/ms
Iteration   2: 166.776 ops/ms
Iteration   3: 187.501 ops/ms
Iteration   4: 189.186 ops/ms
Iteration   5: 188.416 ops/ms
Iteration   6: 188.693 ops/ms
Iteration   7: 186.178 ops/ms
Iteration   8: 188.649 ops/ms
Iteration   9: 187.341 ops/ms
Iteration  10: 186.805 ops/ms

Result "org.logstash.benchmark.QueueBenchmark.pushToPersistedQueue":
  185.834 ±(99.9%) 10.230 ops/ms [Average]
  (min, avg, max) = (166.776, 185.834, 189.186), stdev = 6.767
  CI (99.9%): [175.604, 196.064] (assumes normal distribution)

# Run complete. Total time: 00:00:36

Benchmark                             Mode  Cnt    Score    Error   Units
QueueBenchmark.pushToPersistedQueue  thrpt   10  185.834 ± 10.230  ops/ms

BUILD SUCCESSFUL

Total time: 40.806 secs

There's probably a bit of a speedup (50%) achievable with almost no effort by buffering writes in line with durability settings instead of straight flushing the buffer on every write, but I didn't want to invest more than a few minutes just yet before we had a chance to talk about things :) Also note that the current implementation maxes out already at about twice the above throughput if no physical writes are made to the channel (just out-comment the actual write), so we need to optimize the surroundings before we want more than this imo, we're far from I/O bound here even with the linked PR.

So for me, the approach here seems straight forward:

What do you guys think?

original-brownbear commented 7 years ago

@suyograo @jordansissel @colinsurprenant FYI

colinsurprenant commented 7 years ago

Thanks @original-brownbear - nice work! This looks promising!

original-brownbear commented 7 years ago

@colinsurprenant thanks :)

I'd like to see the behavior in a read/write situation on the same page.

Yea we should have this for sure. I'll try to come up with a JMH for that, any input welcome here ... not sure what a realistic scenario would be? Just two threads on one Queue one writing in a loop and the other trying to drain?

I believe we will loose the process crash durability characteristic of using mmapping?

Phew :D "it depends" :) ... on how we set up our FileChannel. If we get it from a RandomAccessFile which has no JVM side write cache, we're in the same spot we're in for mmap I believe. If we use other FileChannel implementations we'll see different behavior if they do JVM side caching (or we do it manually, which would be a huge speedup potentially).

colinsurprenant commented 7 years ago

Just two threads on one Queue one writing in a loop and the other trying to drain?

Yeah, we can start from there.

If we get it from a RandomAccessFile which has no JVM side write cache, we're in the same spot we're in for mmap I believe.

That is a good question, deserves validation. Once we correctly assess this we will be in better shape to decide which strategy is better WRT performance vs durability.

colinsurprenant commented 7 years ago

Also could you provide the information about the hardware + OS environment this was run on so we can compare? If it wasn't Linux, it should also be run on Linux.

original-brownbear commented 7 years ago

@colinsurprenant so the above was OSX latest MBP 2.9GHZ i7.

Then I tried Linux just now ... but Linux sucks:

Before (master) and (after) are kinda the same and pretty slow (despite ):

# Run progress: 0.00% complete, ETA 00:00:03
# Fork: 1 of 1
# Warmup Iteration   1: ERROR StatusLogger No log4j2 configuration file found. Using default configuration: logging only errors to th66.763 ops/ms
# Warmup Iteration   2: 92.559 ops/ms
# Warmup Iteration   3: 97.081 ops/ms
Iteration   1: 89.533 ops/msG
Iteration   2: 89.650 ops/msG
Iteration   3: 88.168 ops/msG
Iteration   4: 90.817 ops/msG
Iteration   5: 92.475 ops/msG
Iteration   6: 90.764 ops/msG
Iteration   7: 92.728 ops/msG
Iteration   8: 98.508 ops/msG
Iteration   9: 89.712 ops/msG
Iteration  10: 80.121 ops/msG
Iteration  11: 88.515 ops/msG
Iteration  12: 87.272 ops/msG
Iteration  13: 87.973 ops/msG
Iteration  14: 88.658 ops/msG
Iteration  15: 90.905 ops/msG
Iteration  16: 90.466 ops/msG
Iteration  17: 85.875 ops/msG
Iteration  18: 33.412 ops/msG
Iteration  19: 34.286 ops/msG
Iteration  20: 34.756 ops/msG
Iteration  21: 34.746 ops/msG
Iteration  22:  93% EXECUTING
<============-> 93% EXECUTING
> :logstash-core-benchmarks:jmh

=> fsync on a default Debian ext4 just sucks ... seems others arrived at the same conclusion:

http://www.ilsistemista.net/index.php/linux-a-unix/6-linux-filesystems-benchmarked-ext3-vs-ext4-vs-xfs-vs-btrfs.html?start=4

Will research this more tomorrow :)

original-brownbear commented 7 years ago

Ok toyed a little more with this, I'll just give the cliffnotes and would just add the data people are interested in, to not make this too spammy :)

I would draw these conclusions (assuming I didn't make a mistake in the above):

original-brownbear commented 7 years ago

just to illustrate how much room we have optimizing the serialization :) This is the heap when running the benchmark:

screen shot 2017-06-06 at 09 11 42
colinsurprenant commented 7 years ago

It would be interesting to see the relation between:

In particular, knowing that mmap IO uses off-heap memory, can we see a page size to free memory ratio where it perform better/worst. Same idea with channel IO, does the heap size have an impact?

Serialization is the elephant in the room for performance impact, this is known. (cc @guyboertje)

Let's not loose sight of the bigger picture in terms of performance. Obviously reducing or disabling fsync and/or setting barrier=0 would improves throughput significantly for raw write performance, if durability is not important then users would be better using in-memory queuing. So we have to keep in mind that durability in an important property of PQ (but a tuneable one, between max durability and lesser durability depending on your requirements/tolerance).

Buffer writes to the Channel in accordance with ack settings

This is a good idea but this is again a balancing act between throughput and durability. An interesting aspect of mmap IO is that as soon as your bytes are written to the buffer, they will make it to disk regardless of a process crash. Buffering writes at app level for channel IO creates one more potential data loss layer upon process crash. I'd rather put more efforts toward having proper batch writes and end-to-end batch support which mitigates the need to have extra buffering while improving throughput.

colinsurprenant commented 7 years ago

Also, did you compare throughput between SSD and spinning disks?

jakelandis commented 7 years ago

Optimize for ext4

We should also validate on XFS, since it is the default for RHEL-7

original-brownbear commented 7 years ago

@colinsurprenant

page size and OS memory for mmap IO.

In the current implementation, we're calling .load on the buffer so we need at least page-size bytes of off heap memory. Removing the .load calls makes things less deterministic, but assuming the case of writing a new page and filling it (while the page size fits in memory), the OS and the JVM will do their best to still put the whole page in off-heap memory.

page size and JVM heap size for channel IO.

There is no relation. The buffer size is constant in this scenario and whether you append to a 1TB or 100MB file makes no difference for a channel.

Both should not be affected by heap-size considerations at all (in terms of just pure I/O). Neither implementation uses on-heap buffers.

Also, did you compare throughput between SSD and spinning disks?

Phew I can try and dig out one of those :D But it's probably not even worth the time, obviously the spinning disk will be slower with mmap than with append writes since seeks are so expensive on a spinning disk and mmap is known to trigger "unnecessary" (not that I could create a better mmap, just comes with the territory of random access) seeks.

colinsurprenant commented 7 years ago

@original-brownbear what actually interest me is the page fault cost relative to the amount of free memory and page size. if the amount of free memory is significantly larger compared to the page size, I am guessing that page fault will not be very costly overall.

original-brownbear commented 7 years ago

@colinsurprenant

if the amount of free memory is significantly larger compared to the page size, I am guessing that page fault will not be very costly overall.

Sure in practice this statement holds up to a certain throughput x and queue size y above which you start putting pressure on you OS's paging logic. Obviously, x and y will be to a degree dependent on the JVM's consumption of overall available memory, but how does that relate to designing the queue?

colinsurprenant commented 7 years ago

@original-brownbear not sure I understand your previous comment. what I am trying to say is that with the current mmap implementation, a maximum of two pages can be mmaped/loaded simultaneously. for a page size P, a maximum of 2P of off-heap memory will be required to hold two pages byte buffers. if the amount of off-heap available memory is somewhat larger that 2P then the amount of page faults should be minimal. I am wondering what is that somewhat larger factor or ratio of off-heap memory relative to page size and if there is a rule of thumb we can see where below a given ratio page faults will start to have a significant performance impact. Or put another way, if the amount of off-heap memory is a lot bigger than a page size, say 10x or 100x can we assume that page fault will not have any negative performance impact?

colinsurprenant commented 7 years ago

So, I believe this is something we should investigate and also take into account when benchmarking.

original-brownbear commented 7 years ago

@colinsurprenant

If the amount of off-heap available memory is somewhat larger that 2P then the amount of page faults should be minimal

Not really, evicting the page cache is not free at all and has latency to it. The exact semantics of how the amount of free memory influences the throughput are strongly influenced by the envirionment. I don't think this is all that relevant, fortunately. We should simply try to get the code to by physically I/O bound in a microbenchmark and then minimize its memory footprint, that's all we can do in terms of moving towards a defined target.

Put differently: The amount of free memory shouldn't have a significant impact (unless we're running so low that the system is impacted) on write performance since we're fsyncing anyways.

colinsurprenant commented 7 years ago

I wonder what this "so low" is actually. For example, if a very large pages size is configured (say in GB) and the off-heap memory is small, what will the performance impact versus smaller page size or larger free memory. At which point will there be a significant performance impact?

original-brownbear commented 7 years ago

@colinsurprenant for FileChannel IO there is no difference between reading from or writing to small and a large file (assuming the "small" file is still at least a few MB in size) here. The filesize is completely irrelevant to raw I/O performance in this case.

colinsurprenant commented 7 years ago

Oh, I was talking about mmap IO.

original-brownbear commented 7 years ago

@colinsurprenant for mmap this is simply a question of what write throughput (i.e.) rate of allocation of virtual memory and all the related cleanups your system can handle. The slower your system (seek + write and to a lesser degree CPU resources) the more RAM you'll need to make mmap not slow down on either input spikes. Just high load leading to slow down (i.e run into #7350) cannot be handled by more RAM since the slowdown is caused by the system not cleaning up vmem/mem quickly enough. More RAM just delays the slowdown.

colinsurprenant commented 7 years ago

@original-brownbear I am sure this can be measured, at which free-memory ratio relative to page size is mmap underperforming. The cause of #7350 is not yet known, it might not be related.

original-brownbear commented 7 years ago

@colinsurprenant

The cause of #7350 is not yet known, it might not be related.

I explicitly reported #7350 behavior in https://github.com/elastic/logstash/issues/7162#issuecomment-302796372 and https://github.com/elastic/logstash/issues/7162#issue-229907910 ?

colinsurprenant commented 7 years ago

@original-brownbear At this point we cannot tell if it is related or not. Premature assumption.

original-brownbear commented 7 years ago

@colinsurprenant but https://github.com/elastic/logstash/issues/7317#issue-233696479 also shows that the drop in performance goes away with Channel writes? How could this possibly not be related? The only logical way this could be would be, that we have a second performance issue in the code up-stack from the Queue that comes with the exact same performance characteristics. But since writing to a Channel removes the issue, this is also impossible?

colinsurprenant commented 7 years ago

@original-brownbear the execution context in #7317 is completely different from #7350. If the problem end up being related it is all for the better and we will have more facts to help move forward with some refactoring.

original-brownbear commented 7 years ago

@colinsurprenant what facts?

So the only way this (mmap) is not the cause of #7350 would be that there is another bug, that has the exact same characteristics further up-stack. This seems extremely unlikely, especially considering that the ack logic isn't exposed further up the stack as far as I can see?

Moreover, due to the behavior that is shown in the Channel PR the situation is even worse: Since that bug is there, even if you fixed whatever upstack issue there was ... you wouldn't see a qualitative change in behavior (depending on the hypothetical scenario here ... you may see a quantitative one). While if you fixed the mmap issue, existing benchmarks would beyond the shadow of a doubt prove, that there is a remaining issue upstack (but since using the in memory queue doesn't show the problem, we actually already have the conclusion here anyways ???).

colinsurprenant commented 7 years ago

Thanks @original-brownbear for your input, I keep that in mind. I'll followup when I find out.

jordansissel commented 7 years ago

mmap strategy should only be dirtying memory pages that are modified, and our queue page files are append-only, so the only dirty pages that should ever exist in a page are new-data appends. Memory pages that are not accessed recently will be evicted by the OS to make use for other more important things than idle memory pages. Hand-wave, my rough understanding is that the only live memory pages in our queue will be the ones that are recently-dirtied.

Focusing on FileChannel, I'm trying to come up with pros/cons that aren't speed-related:

original-brownbear commented 7 years ago

@jordansissel

MMap writes have no feedback and if they fail we probably won't know until later?

well on force you get immediate feedback, but it's a segfault of the JVM in many cases :( Standard put into the buffer doesn't really give feedback though, true.

(Though I confess page files this large may have no benefit)

Depends on the system I guess. Probably not, the benefit is that you save fds at these scales I think (unless there is a fs out there that really sucks at creating files and that I don't know about). If you have 1TB+ of data then yes, for what we expect (at least for now :P) likely no :)

original-brownbear commented 7 years ago

User seemingly affected directly by the allocation issue: https://discuss.elastic.co/t/logstash-not-starting-with-cannot-allocate-memory-error/88558

jakelandis commented 7 years ago

I have been trying to follow this discussion on write performance of mmap vs. channel io for the Persistent Queue page and wanted to validate my understanding of how things work.

mmap

The write path for mmap files is using a MappedByteBuffer via FileChannel.map. The MappedByteBuffer is a Direct Buffer that is mapped directly to the Page Cache via the OS's virtual memory (VM). Such that we have direct access to the Page Cache, but must pay the cost of a soft/minor page fault every 4k (or what ever the page size is set to) for the MMU to correctly map to the physical memory.

So mmap is (create) File -> (update) Buffer -> (sync) Disk

The sync to disk can be explit or implicit

Channel IO

The write path for Channel IO, is similar, except that it uses a another buffer first. That buffer can be on or off heap, but preferably off heap. The allocation of the off heap buffer works similar to malloc, such that is treated as standard memory allocation with a continuous memory block, and thus does not require continue soft/minor page faults to keep writing. Instead of writing directly to the file's page cache, we first write to the first buffer, then simply copy the contents to the second Page Cache buffer.

So Channel IO is (on or off heap) Buffer -> (create) File -> (copy from first buffer) Buffer -> (sync) Disk

The sync can be explicit or implicit, but does have the additional step of needing to copy buffers first.

Summary

So w/r/t to write performance of mmap vs. Channel IO, it comes down to :

Cost associated with: [Double Buffer + Memory Copy] vs. [Single Buffer + MMU soft page faults]

There are certainly other factors w/r/t to write performance in play.... but I believe that this is the heart of the difference. If I am correct, then won't any benchmarks really be measuring the differences in hardware speed and kernel versions. On platform X is the MMU VM allocation faster then malloc memory allocations + copy for X page size, or vice-versa ? I assume that this may the difference for why the mac based tests were so fast in relation to the Linux based tests (was it done it different hardware?). It would seem that all benchmark attempts will be highly dependent (more so then normal) by the platform (hardware and software) on which it is run.

It seem this the answer to which is faster may be it depends. I think the largest factors for performance here are the other noted performance issues, and particular the READ speed (which pre mapped mmap seems like a good fit).

I am not an expert in this space, but unless I am missing something, I don't think that the write speed of mmap vs. channel IO should drive the over all design decision since the bench-marking will be fundamentally tied to the hardware on which it is run.

Also, of side note....

One minor optimization for the mmap files may be to try to increase the page size, though early testing shows only very marginal increase in performance (likely within the realm of statistical noise).

original-brownbear commented 7 years ago

@jakelandis :) let me try and address a few points here:

If I am correct, then won't any benchmarks really be measuring the differences in hardware speed and kernel versions.

Given an optimal implementation in C yes, results will be very close between fwrite and mmap implementations and basically just measuring the physical drive throughput (assuming all you care about is writing a single large chunk of data, we don't do that, we write many small chunks sequentially and fsync in between => very different scenario, see below). The fact that we want to fsync many smaller writes changes the scenario a lot.

Now mmap forces synchronizing the whole file (doesn't mean you have to really write all sectors, but there is overhead in figuring out what to write) even though only a small section of its data was changed between synchronizations. We know that only the tail of the file changed, the Kernel can't know that. If you do append writes from a FileChannel though, the kernel does know that and the smaller area that needs to be synchronized, makes fsync calls a whole lot cheaper. This fact actually led to the benchmark result that,

One minor optimization for the mmap files may be to try to increase the page size

... the exact opposite is true :( Larger pages make the fsync a ton more expensive, because you increase the size of the memory area that needs to be needlessly synchronized with the disk.

I don't think that the write speed of mmap vs. channel IO should drive the over all design decision since the bench-marking will be fundamentally tied to the hardware on which it is run.

Assuming the fsync problem wouldn't be around I would agree here. The fsync requirement though actually creates a situation where the I/O speed is bound by the design decision made, at least on systems that respect fsync. AWS instances (ephemeral SSD and EBS at least) don't for example. Neither do many VM providers (there's actually blog posts out there that claim VM disk access is faster than direct for magical reasons:D) or ext3/4 setups if you set barrier=0 and so on :) What we should optimize for is resource consumption of the writes, here mmap as implemented by the JVM (note in the JVM's defense it actually does offer "mmap" on systems that don't do that call natively) comes with a lot of overhead compared to the outright system call from C (see the fact that we have to manually force the Cleaner etc. here). Hope this clears some of this stuff and the way the discussion went so far up :)

jakelandis commented 7 years ago

@original-brownbear - Thanks for explanation, that make much more sense.

I wrote a little application to try quantify the cost of fsync between mmap vs. channel io for sequential IO. (I think the code is illustrative of the issue here, if not please let me know)

https://github.com/jakelandis/fsync_compare

You can run yourself (or pull from source):

cd /tmp
wget https://github.com/jakelandis/fsync_compare/raw/master/bin/fsync_compare-0.1-SNAPSHOT.jar
java -Xmx1024m -jar fsync_compare-0.1-SNAPSHOT.jar 

I am getting really odd results based on the system on which i am running:

Linux / Fedora - kernel 4.10.14-200.fc25.x86_64

Channel Test
==============
Creating a 1GB tempfile : /tmp/channel273162193519157734.tmp
....
# of Writes : 3486175
All total time             : 4230738593 ns
Fsync total time           : 4172383 ns
Fsync Percent of total time : 0.09862067599504841 %
MMap Test
==============
Creating a 1GB tempfile : /tmp/mmap6091803940181452000.tmp
....
# of Writes : 3486175
All total time             : 422384143 ns
Fsync total time           : 3389767 ns
Fsync Percent of total time : 0.8025317844377506 %

Macbook Pro 2016 / Sierra

Channel Test
==============
Creating a 1GB tempfile : /var/folders/x4/76h2pqn96jl492qycz3fz7q00000gn/T/channel3614441115713821871.tmp
.....
# of Writes : 3486175
All total time             : 6854179446 ns
Fsync total time           : 942155809 ns
Fsync Percent of total time : 13.745712618449588 %
MMap Test
==============
Creating a 1GB tempfile : /var/folders/x4/76h2pqn96jl492qycz3fz7q00000gn/T/mmap2460314887547029005.tmp
....
# of Writes : 3486175
All total time             : 308262885215 ns
Fsync total time           : 307584243992 ns
Fsync Percent of total time : 99.77984984389974 %

The difference could be the mount options for /tmp on Linux vs. /var on mac (though I didn't explicitly see it in fstab).

Not necessarily trying accomplish anything specific here except understand the performance differences.

jakelandis commented 7 years ago

Ahh.. the issue was the /tmp directory on Linux ... mapping the queue file directly to a regular directory (slightly old SSD) gets similar results to the Mac

Channel Test
==============
Creating a 1GB tempfile : /apps/tmp1
...
# of Writes : 3486175
All total time             : 16385347605 ns
Fsync total time           : 8102758793 ns
Fsync Percent of total time : 49.45124747019366 %
MMap Test
==============
Creating a 1GB tempfile : /apps/tmp2
....
# of Writes : 3486175
All total time             : 8311587013 ns
Fsync total time           : 7535912499 ns
Fsync Percent of total time : 90.6675522642453 %
original-brownbear commented 7 years ago

@jakelandis thanks for this benchmark, the difference tmpfs to persistent partition is pretty instructive here imo :)

jakelandis commented 7 years ago

What's interesting is that on Linux while a much higher percentage (40% more) of the time is spent on Fsync via MMap writes, but the overall time and the total Fsync time is actually lower.

Channel IO = 16.38s MMap IO = 8.31s .5s less time in fsync for MMap IO

I would encourage others to check my benchmark code for logic errors or errors that may skew the results, and to run on newer hardware [1].

To avoid the tmpfs skew change the temp directory like this:

java -Xmx1024m -Djava.io.tmpdir=/apps -jar fsync_compare-0.1-SNAPSHOT.jar 

[1] Test run on an 2007 8 core/32GB MacPro 1,1 with Xeon 2.66 running Fedora 25

mount/fs:

/dev/disk/by-uuid/626fca5e-587e-4c9a-8807-20b3d5ce7f59 /apps auto nosuid,nodev,nofail,x-gvfs-show,x-gvfs-name=ssd 0 0
/dev/sdc1               ext4      220G  307M  208G   1% /apps
original-brownbear commented 7 years ago

@jakelandis I think the reason for this is the file system capability "sparse allocation" that OSX doesn't have. => OSX has to actually allocate the disk space prior to the run, Linux doesn't have to because it has sparse allocation. Little short on time right now, but Google fill in the blanks I just left here easily I think :)

jordansissel commented 7 years ago

Now mmap forces synchronizing the whole file (doesn't mean you have to really write all sectors, but there is overhead in figuring out what to write) even though only a small section of its data was changed between synchronizations.

Can you cite where this happens "overhead in figuring out"? I'm not disputing, I'm trying to understand more what exactly happens.

I tried to look at msync vs fdatasync on Linux. Both msync and fdatasync manpages claim to flush "all modified in-core data". I went walking in the linux kernel code to see what kind of work is done but got lost in a maze of method calls -- I may revisit later. Based on my reading, it seems like both msync fdatasync have to do the same basic work (iterate over dirty pages + flush to disk), but it sounds like from your description, mmap has to iterate all pages to even find which ones are dirty?

I'm interested in studying, so I will see about writing a test that compares mmap+force vs filechannel+fsync

jakelandis commented 7 years ago

What's interesting is that on Linux while a much higher percentage (40% more) of the time is spent on Fsync via MMap writes, but the overall time and the total Fsync time is actually lower.

@original-brownbear - Sorry, I what I meant was:

What's interesting is that on Linux, when comparing Channel IO vs. MMap on Linux, while a much higher percentage (40% more) of the time is spent on Fsync via MMap writes, but the overall time and the total Fsync time is actually lower.

I am not concerned about OSX/MacOs.

(no one cares about OSX/HFS in production) agreed

jakelandis commented 7 years ago

@colinsurprenant

Also, did you compare throughput between SSD and spinning disks?

Same machine as above but with (software) RAID 1 spinny disks.

Channel Test
==============
Creating a 1GB tempfile : /storage/channel3898164774960818867.tmp
# of Writes : 3486175
All total time             : 228764226834 ns
Fsync total time           : 220368264387 ns
Fsync Percent of total time : 96.32986216280554 %
MMap Test
==============
Creating a 1GB tempfile : /storage/mmap8942863789224568451.tmp
# of Writes : 3486175
All total time             : 255131738496 ns
Fsync total time           : 254270271278 ns
Fsync Percent of total time : 99.66234415871645 %

Here Channel IO wins.

original-brownbear commented 7 years ago

@jordansissel

but it sounds like from your description, mmap has to iterate all pages to even find which ones are dirty?

Jup, I can add a link to source code later once I have more time (ping me if you still care and I forget), but if you think about it, this is actually self-evident and must be true for all systems because:

Assuming I mmap some file, now I modify the buffer but simultaneously modify the physical file as well (just open another fd on it) and then msync the buffer. The modifications done via the second fd need to be overwritten, but the only way this can happy is if the msync looks for dirty pages again :) Since the file is not synced back to the mapped buffer in any deterministic way, it simply doesn't suffice for the buffer to just track which of its memory addresses it altered, it needs to look back into the fs. Hope this makes sense, otherwise better explanation later :)

jakelandis commented 7 years ago

@jordansissel

I'm interested in studying, so I will see about writing a test that compares mmap+force vs filechannel+fsync

That is what fsync_compare does, though it maybe poorly named and is pretty limited in scope w/r/t to permutations of test conditions.

jordansissel commented 7 years ago

@jakelandis I ran your test on my laptop (linux vm) with two modifications:

1) Set the java tmpdir to somewhere that hits disk (/tmp is hosted in RAM with tmpfs on this machine) 2) comment out the . printing to clean up the output for my eyes (should not impact the results)

% java -Djava.io.tmpdir=$HOME/build/tmp/ -jar target/fsync_compare-0.1-SNAPSHOT.jar
Channel Test
==============
Creating a 1GB tempfile : /home/jls/build/tmp/channel7259058140351604702.tmp

# of Writes : 3486175
All total time             : 39483991393 ns
Fsync total time           : 35971409167 ns
Fsync Percent of total time : 91.10378129951995 %
MMap Test
==============
Creating a 1GB tempfile : /home/jls/build/tmp/mmap1796784660482438602.tmp

# of Writes : 3486175
All total time             : 40406788454 ns
Fsync total time           : 39461419475 ns
Fsync Percent of total time : 97.66037090506158 %

The resulting execution time is nearly identical (40.4 seconds MMap, 39.4 seconds FileChannel), so it is likely the "fsync percent total" (97% for MMap, 91% for FileChannel) is obscuring something (for example, a full buffer will cause a FileChannel.write() to block where an MMap.put will never block)

jordansissel commented 7 years ago

Running perf record (and later, perf report) on both @jakelandis' mmap and channel tests --

% sudo perf record -e 'block:*' -o mmap.perf java -Djava.io.tmpdir=$HOME/build/tmp -jar target/fsync_compare-0.1-SNAPSHOT.jar mmap
% sudo perf record -e 'block:*' -o channel.perf java -Djava.io.tmpdir=$HOME/build/tmp -jar target/fsync_compare-0.1-SNAPSHOT.jar channel
perf event mmap channel
block:block_touch_buffer 0 0
block:block_dirty_buffer 263K 3M
block:block_rq_abort 0 0
block:block_rq_requeue 0 0
block:block_rq_complete 44 44
block:block_rq_insert 8K 6K
block:block_rq_issue 8K 6K
block:block_bio_bounce 0 0
block:block_bio_complete 63 12
block:block_bio_backmerge 3K 7
block:block_bio_frontmerge 0 0
block:block_bio_queue 23K 13K
block:block_getrq 8K 6K
block:block_sleeprq 0 0
block:block_plug 6K 6K
block:block_unplug 6K 6K
block:block_split 0 0
block:block_bio_remap 20K 10K
block:block_rq_remap 0 0
jordansissel commented 7 years ago

perf trace -s for channel vs mmap in @jakelandis's test:

fdatasync() vs msync() performance is practically identical. Raw data in the table below. I ran this test in a Fedora 23 VM that uses xfs for the filesystem.

operation minimum average max stddev
fdatasync (channel) 7.103ms 11.112ms 83.761ms 0.56%
msync (mmap) 7.081ms 11.454ms 78.684ms 0.59%

Left side is FileChannel. Right side is MMap -- perf trace -s output only showing the main java thread (I think).

Notably, the test I am running assumes one write == one event (filechannel). In some future where we support bulk writes into the queue, one FileChannel.write() is likely to be many (10-1000) events which would significantly reduce the number of write() syscalls

 java (8811), 20988182 events, 99.7%                                                       | java (8746), 14013259 events, 99.7%
                                                                                           |
   syscall            calls    total       min       avg       max      stddev             |   syscall            calls    total       min       avg       max      stddev
                               (msec)    (msec)    (msec)    (msec)        (%)             |                               (msec)    (msec)    (msec)    (msec)        (%)
   --------------- -------- --------- --------- --------- ---------     ------             |   --------------- -------- --------- --------- --------- ---------     ------
   clock_gettime    7000707 59935.226     0.000     0.009    13.935      0.16%             |   clock_gettime    6999698 68895.173     0.000     0.010    45.412      0.22%
   fdatasync           3404 37823.723     7.103    11.112    83.761      0.56%             |   msync               3404 38988.243     7.081    11.454    78.684      0.59%
   write            3486210 34226.998     0.000     0.010    11.871      0.24%             |   read                1304    14.945     0.005     0.011     0.212      2.86%
   read                1406    11.675     0.005     0.008     0.623      5.51%             |   lseek               1244    14.412     0.005     0.012     0.467      4.87%
   futex                182    10.018     0.005     0.055     4.108     43.47%             |   futex                145     4.094     0.005     0.028     0.276     17.06%
   lseek               1330     9.382     0.005     0.007     0.247      3.07%             |   mprotect             126     1.876     0.006     0.015     0.151      8.53%
   gettimeofday         185     1.767     0.005     0.010     0.056      4.61%             |   gettimeofday         152     1.689     0.005     0.011     0.146      9.33%
   mprotect             124     1.450     0.005     0.012     0.185     12.58%             |   mmap                 104     1.638     0.006     0.016     0.284     16.74%
   mmap                 107     1.138     0.005     0.011     0.054      6.46%             |   open                  60     1.162     0.008     0.019     0.114     11.38%
   clone                 11     1.128     0.000     0.103     0.179     14.53%             |   clone                 11     1.145     0.000     0.104     0.163     14.58%
   open                  67     1.061     0.005     0.016     0.134     14.39%             |   write                 35     0.973     0.006     0.028     0.121     18.67%
   stat                  58     0.757     0.005     0.013     0.035      6.92%             |   stat                  54     0.815     0.007     0.015     0.057      8.82%
   fstat                 63     0.449     0.005     0.007     0.022      7.71%             |   lstat                 53     0.624     0.006     0.012     0.048      9.71%
   lstat                 53     0.442     0.005     0.008     0.025      6.72%             |   munmap                27     0.449     0.010     0.017     0.062     12.00%
   close                 54     0.435     0.005     0.008     0.082     18.46%             |   fstat                 61     0.438     0.005     0.007     0.033      8.95%
   munmap                31     0.359     0.006     0.012     0.026      8.22%             |   sched_getaffinity        4     0.427     0.005     0.107     0.339     73.95%
   rt_sigaction          25     0.155     0.005     0.006     0.018      9.53%             |   close                 51     0.362     0.005     0.007     0.021      8.02%
   access                 7     0.119     0.008     0.017     0.031     17.26%             |   madvise                2     0.176     0.014     0.088     0.161     83.71%
   socket                 3     0.064     0.010     0.021     0.031     29.02%             |   unlink                 2     0.161     0.037     0.081     0.124     53.93%
   rt_sigprocmask        10     0.058     0.005     0.006     0.008      5.55%             |   rt_sigaction          25     0.146     0.005     0.006     0.010      5.47%
   unlink                 2     0.052     0.011     0.026     0.040     56.16%             |   socketpair             1     0.136     0.136     0.136     0.136      0.00%
   getdents               6     0.047     0.005     0.008     0.012     16.26%             |   getdents               6     0.128     0.009     0.021     0.076     51.96%
   socketpair             1     0.044     0.044     0.044     0.044      0.00%             |   socket                 3     0.121     0.015     0.040     0.071     40.68%
   fcntl                  8     0.043     0.005     0.005     0.006      4.26%             |   access                 7     0.109     0.011     0.016     0.032     18.15%
   fchdir                 4     0.036     0.005     0.009     0.018     34.77%             |   ftruncate              3     0.086     0.009     0.029     0.046     37.49%
   getrlimit              6     0.032     0.005     0.005     0.008     10.32%             |   connect                2     0.077     0.017     0.038     0.060     55.41%
   sched_getaffinity        4     0.031     0.005     0.008     0.012     19.76%           |   rt_sigprocmask        10     0.062     0.005     0.006     0.010      8.98%
   getcwd                 2     0.024     0.006     0.012     0.018     51.66%             |   mkdir                  1     0.049     0.049     0.049     0.049      0.00%
   connect                2     0.022     0.007     0.011     0.015     37.47%             |   fcntl                  8     0.046     0.005     0.006     0.006      3.31%
   geteuid                4     0.020     0.005     0.005     0.005      4.42%             |   getrlimit              6     0.031     0.005     0.005     0.005      1.97%
   set_robust_list        1     0.018     0.018     0.018     0.018      0.00%             |   geteuid                4     0.023     0.006     0.006     0.006      3.19%
   sched_yield            2     0.016     0.006     0.008     0.010     22.45%             |   fchdir                 4     0.021     0.005     0.005     0.006      2.45%
   madvise                1     0.012     0.012     0.012     0.012      0.00%             |   sched_yield            2     0.015     0.006     0.007     0.009     19.05%
   ftruncate              2     0.011     0.005     0.006     0.006     12.29%             |   clock_getres           2     0.012     0.005     0.006     0.006      8.15%
   gettid                 2     0.011     0.005     0.006     0.006     11.64%             |   gettid                 2     0.010     0.005     0.005     0.005      2.02%
   clock_getres           2     0.010     0.005     0.005     0.005      1.04%             |   getcwd                 1     0.009     0.009     0.009     0.009      0.00%
   getuid                 1     0.010     0.010     0.010     0.010      0.00%             |   getsockname            1     0.008     0.008     0.008     0.008      0.00%
   uname                  1     0.006     0.006     0.006     0.006      0.00%             |   getuid                 1     0.006     0.006     0.006     0.006      0.00%
   mkdir                  1     0.006     0.006     0.006     0.006      0.00%             |   uname                  1     0.006     0.006     0.006     0.006      0.00%
   getsockname            1     0.006     0.006     0.006     0.006      0.00%             |   set_robust_list        1     0.005     0.005     0.005     0.005      0.00%
   setrlimit              1     0.005     0.005     0.005     0.005      0.00%             |   setrlimit              1     0.005     0.005     0.005     0.005      0.00%
                                                                                           |
colinsurprenant commented 4 years ago

Closing - no followup is planned about this in the foreseeable future. We will reboot if need be.