Closed original-brownbear closed 4 years ago
@suyograo @jordansissel @colinsurprenant FYI
Thanks @original-brownbear - nice work! This looks promising!
.force()
, hence our optimistic recovery mechanism. @colinsurprenant thanks :)
I'd like to see the behavior in a read/write situation on the same page.
Yea we should have this for sure. I'll try to come up with a JMH for that, any input welcome here ... not sure what a realistic scenario would be? Just two threads on one Queue
one writing in a loop and the other trying to drain?
I believe we will loose the process crash durability characteristic of using mmapping?
Phew :D "it depends" :) ... on how we set up our FileChannel
. If we get it from a RandomAccessFile
which has no JVM side write cache, we're in the same spot we're in for mmap
I believe. If we use other FileChannel
implementations we'll see different behavior if they do JVM side caching (or we do it manually, which would be a huge speedup potentially).
Just two threads on one Queue one writing in a loop and the other trying to drain?
Yeah, we can start from there.
If we get it from a RandomAccessFile which has no JVM side write cache, we're in the same spot we're in for mmap I believe.
That is a good question, deserves validation. Once we correctly assess this we will be in better shape to decide which strategy is better WRT performance vs durability.
Also could you provide the information about the hardware + OS environment this was run on so we can compare? If it wasn't Linux, it should also be run on Linux.
@colinsurprenant so the above was OSX latest MBP 2.9GHZ i7.
Then I tried Linux just now ... but Linux sucks:
Before (master
) and (after
) are kinda the same and pretty slow (despite ):
# Run progress: 0.00% complete, ETA 00:00:03
# Fork: 1 of 1
# Warmup Iteration 1: ERROR StatusLogger No log4j2 configuration file found. Using default configuration: logging only errors to th66.763 ops/ms
# Warmup Iteration 2: 92.559 ops/ms
# Warmup Iteration 3: 97.081 ops/ms
Iteration 1: 89.533 ops/msG
Iteration 2: 89.650 ops/msG
Iteration 3: 88.168 ops/msG
Iteration 4: 90.817 ops/msG
Iteration 5: 92.475 ops/msG
Iteration 6: 90.764 ops/msG
Iteration 7: 92.728 ops/msG
Iteration 8: 98.508 ops/msG
Iteration 9: 89.712 ops/msG
Iteration 10: 80.121 ops/msG
Iteration 11: 88.515 ops/msG
Iteration 12: 87.272 ops/msG
Iteration 13: 87.973 ops/msG
Iteration 14: 88.658 ops/msG
Iteration 15: 90.905 ops/msG
Iteration 16: 90.466 ops/msG
Iteration 17: 85.875 ops/msG
Iteration 18: 33.412 ops/msG
Iteration 19: 34.286 ops/msG
Iteration 20: 34.756 ops/msG
Iteration 21: 34.746 ops/msG
Iteration 22: 93% EXECUTING
<============-> 93% EXECUTING
> :logstash-core-benchmarks:jmh
=> fsync
on a default Debian ext4
just sucks ... seems others arrived at the same conclusion:
Will research this more tomorrow :)
Ok toyed a little more with this, I'll just give the cliffnotes and would just add the data people are interested in, to not make this too spammy :)
mmap
writes more than Channel IOmmap
(on Linux at least, on OSX I have no clue how to check for real memory usage correctly ... though if the "Activity Monitor" isn't lying there is not visible effect on OSX)barrier=0
makes for a huge speedup => fsync
hits us hard with Ext4I would draw these conclusions (assuming I didn't make a mistake in the above):
fsync
(can't do much here, but probably have wiggle room with the checkpointing strategy)Channel
in accordance with ack
settingsjust to illustrate how much room we have optimizing the serialization :) This is the heap when running the benchmark:
It would be interesting to see the relation between:
In particular, knowing that mmap IO uses off-heap memory, can we see a page size to free memory ratio where it perform better/worst. Same idea with channel IO, does the heap size have an impact?
Serialization is the elephant in the room for performance impact, this is known. (cc @guyboertje)
Let's not loose sight of the bigger picture in terms of performance. Obviously reducing or disabling fsync and/or setting barrier=0
would improves throughput significantly for raw write performance, if durability is not important then users would be better using in-memory queuing. So we have to keep in mind that durability in an important property of PQ (but a tuneable one, between max durability and lesser durability depending on your requirements/tolerance).
Buffer writes to the Channel in accordance with ack settings
This is a good idea but this is again a balancing act between throughput and durability. An interesting aspect of mmap IO is that as soon as your bytes are written to the buffer, they will make it to disk regardless of a process crash. Buffering writes at app level for channel IO creates one more potential data loss layer upon process crash. I'd rather put more efforts toward having proper batch writes and end-to-end batch support which mitigates the need to have extra buffering while improving throughput.
Also, did you compare throughput between SSD and spinning disks?
Optimize for ext4
We should also validate on XFS, since it is the default for RHEL-7
@colinsurprenant
page size and OS memory for mmap IO.
In the current implementation, we're calling .load
on the buffer so we need at least page-size bytes of off heap memory. Removing the .load
calls makes things less deterministic, but assuming the case of writing a new page and filling it (while the page size fits in memory), the OS and the JVM will do their best to still put the whole page in off-heap memory.
page size and JVM heap size for channel IO.
There is no relation. The buffer size is constant in this scenario and whether you append to a 1TB or 100MB file makes no difference for a channel.
Both should not be affected by heap-size considerations at all (in terms of just pure I/O). Neither implementation uses on-heap buffers.
Also, did you compare throughput between SSD and spinning disks?
Phew I can try and dig out one of those :D But it's probably not even worth the time, obviously the spinning disk will be slower with mmap
than with append writes since seek
s are so expensive on a spinning disk and mmap
is known to trigger "unnecessary" (not that I could create a better mmap
, just comes with the territory of random access) seeks.
@original-brownbear what actually interest me is the page fault cost relative to the amount of free memory and page size. if the amount of free memory is significantly larger compared to the page size, I am guessing that page fault will not be very costly overall.
@colinsurprenant
if the amount of free memory is significantly larger compared to the page size, I am guessing that page fault will not be very costly overall.
Sure in practice this statement holds up to a certain throughput x
and queue size y
above which you start putting pressure on you OS's paging logic. Obviously, x
and y
will be to a degree dependent on the JVM's consumption of overall available memory, but how does that relate to designing the queue?
@original-brownbear not sure I understand your previous comment. what I am trying to say is that with the current mmap implementation, a maximum of two pages can be mmaped/loaded simultaneously. for a page size P, a maximum of 2P of off-heap memory will be required to hold two pages byte buffers. if the amount of off-heap available memory is somewhat larger that 2P then the amount of page faults should be minimal. I am wondering what is that somewhat larger factor or ratio of off-heap memory relative to page size and if there is a rule of thumb we can see where below a given ratio page faults will start to have a significant performance impact. Or put another way, if the amount of off-heap memory is a lot bigger than a page size, say 10x or 100x can we assume that page fault will not have any negative performance impact?
So, I believe this is something we should investigate and also take into account when benchmarking.
@colinsurprenant
If the amount of off-heap available memory is somewhat larger that 2P then the amount of page faults should be minimal
Not really, evicting the page cache is not free at all and has latency to it. The exact semantics of how the amount of free memory influences the throughput are strongly influenced by the envirionment. I don't think this is all that relevant, fortunately. We should simply try to get the code to by physically I/O bound in a microbenchmark and then minimize its memory footprint, that's all we can do in terms of moving towards a defined target.
Put differently: The amount of free memory shouldn't have a significant impact (unless we're running so low that the system is impacted) on write performance since we're fsyncing anyways.
I wonder what this "so low" is actually. For example, if a very large pages size is configured (say in GB) and the off-heap memory is small, what will the performance impact versus smaller page size or larger free memory. At which point will there be a significant performance impact?
@colinsurprenant for FileChannel
IO there is no difference between reading from or writing to small and a large file (assuming the "small" file is still at least a few MB in size) here. The filesize is completely irrelevant to raw I/O performance in this case.
Oh, I was talking about mmap IO.
@colinsurprenant for mmap
this is simply a question of what write throughput (i.e.) rate of allocation of virtual memory and all the related cleanups your system can handle.
The slower your system (seek
+ write
and to a lesser degree CPU resources) the more RAM you'll need to make mmap
not slow down on either input spikes. Just high load leading to slow down (i.e run into #7350) cannot be handled by more RAM since the slowdown is caused by the system not cleaning up vmem/mem quickly enough. More RAM just delays the slowdown.
@original-brownbear I am sure this can be measured, at which free-memory ratio relative to page size is mmap underperforming. The cause of #7350 is not yet known, it might not be related.
@colinsurprenant
The cause of #7350 is not yet known, it might not be related.
I explicitly reported #7350 behavior in https://github.com/elastic/logstash/issues/7162#issuecomment-302796372 and https://github.com/elastic/logstash/issues/7162#issue-229907910 ?
@original-brownbear At this point we cannot tell if it is related or not. Premature assumption.
@colinsurprenant but https://github.com/elastic/logstash/issues/7317#issue-233696479 also shows that the drop in performance goes away with Channel writes?
How could this possibly not be related? The only logical way this could be would be, that we have a second performance issue in the code up-stack from the Queue
that comes with the exact same performance characteristics. But since writing to a Channel
removes the issue, this is also impossible?
@original-brownbear the execution context in #7317 is completely different from #7350. If the problem end up being related it is all for the better and we will have more facts to help move forward with some refactoring.
@colinsurprenant what facts?
mmap
is to blame.So the only way this (mmap
) is not the cause of #7350 would be that there is another bug, that has the exact same characteristics further up-stack. This seems extremely unlikely, especially considering that the ack
logic isn't exposed further up the stack as far as I can see?
Moreover, due to the behavior that is shown in the Channel PR the situation is even worse:
Since that bug is there, even if you fixed whatever upstack issue there was ... you wouldn't see a qualitative change in behavior (depending on the hypothetical scenario here ... you may see a quantitative one).
While if you fixed the mmap
issue, existing benchmarks would beyond the shadow of a doubt prove, that there is a remaining issue upstack (but since using the in memory queue doesn't show the problem, we actually already have the conclusion here anyways ???).
Thanks @original-brownbear for your input, I keep that in mind. I'll followup when I find out.
mmap strategy should only be dirtying memory pages that are modified, and our queue page files are append-only, so the only dirty pages that should ever exist in a page are new-data appends. Memory pages that are not accessed recently will be evicted by the OS to make use for other more important things than idle memory pages. Hand-wave, my rough understanding is that the only live memory pages in our queue will be the ones that are recently-dirtied.
Focusing on FileChannel, I'm trying to come up with pros/cons that aren't speed-related:
int
(Though I confess page files this large may have no benefit)@jordansissel
MMap writes have no feedback and if they fail we probably won't know until later?
well on force
you get immediate feedback, but it's a segfault of the JVM in many cases :(
Standard put
into the buffer doesn't really give feedback though, true.
(Though I confess page files this large may have no benefit)
Depends on the system I guess. Probably not, the benefit is that you save fds at these scales I think (unless there is a fs out there that really sucks at creating files and that I don't know about). If you have 1TB+ of data then yes, for what we expect (at least for now :P) likely no :)
User seemingly affected directly by the allocation issue: https://discuss.elastic.co/t/logstash-not-starting-with-cannot-allocate-memory-error/88558
I have been trying to follow this discussion on write performance of mmap vs. channel io for the Persistent Queue page and wanted to validate my understanding of how things work.
The write path for mmap files is using a MappedByteBuffer via FileChannel.map. The MappedByteBuffer is a Direct Buffer that is mapped directly to the Page Cache via the OS's virtual memory (VM). Such that we have direct access to the Page Cache, but must pay the cost of a soft/minor page fault every 4k (or what ever the page size is set to) for the MMU to correctly map to the physical memory.
So mmap is
(create) File -> (update) Buffer -> (sync) Disk
The sync to disk can be explit or implicit
The write path for Channel IO, is similar, except that it uses a another buffer first. That buffer can be on or off heap, but preferably off heap. The allocation of the off heap buffer works similar to malloc, such that is treated as standard memory allocation with a continuous memory block, and thus does not require continue soft/minor page faults to keep writing. Instead of writing directly to the file's page cache, we first write to the first buffer, then simply copy the contents to the second Page Cache buffer.
So Channel IO is
(on or off heap) Buffer -> (create) File -> (copy from first buffer) Buffer -> (sync) Disk
The sync can be explicit or implicit, but does have the additional step of needing to copy buffers first.
So w/r/t to write performance of mmap vs. Channel IO, it comes down to :
Cost associated with:
[Double Buffer + Memory Copy] vs. [Single Buffer + MMU soft page faults]
There are certainly other factors w/r/t to write performance in play.... but I believe that this is the heart of the difference. If I am correct, then won't any benchmarks really be measuring the differences in hardware speed and kernel versions. On platform X is the MMU VM allocation faster then malloc memory allocations + copy for X page size, or vice-versa ? I assume that this may the difference for why the mac based tests were so fast in relation to the Linux based tests (was it done it different hardware?). It would seem that all benchmark attempts will be highly dependent (more so then normal) by the platform (hardware and software) on which it is run.
It seem this the answer to which is faster may be it depends
. I think the largest factors for performance here are the other noted performance issues, and particular the READ speed (which pre mapped mmap seems like a good fit).
I am not an expert in this space, but unless I am missing something, I don't think that the write speed of mmap vs. channel IO should drive the over all design decision since the bench-marking will be fundamentally tied to the hardware on which it is run.
One minor optimization for the mmap files may be to try to increase the page size, though early testing shows only very marginal increase in performance (likely within the realm of statistical noise).
@jakelandis :) let me try and address a few points here:
If I am correct, then won't any benchmarks really be measuring the differences in hardware speed and kernel versions.
Given an optimal implementation in C
yes, results will be very close between fwrite
and mmap
implementations and basically just measuring the physical drive throughput (assuming all you care about is writing a single large chunk of data, we don't do that, we write many small chunks sequentially and fsync
in between => very different scenario, see below).
The fact that we want to fsync
many smaller writes changes the scenario a lot.
Now mmap
forces synchronizing the whole file (doesn't mean you have to really write all sectors, but there is overhead in figuring out what to write) even though only a small section of its data was changed between synchronizations. We know that only the tail of the file changed, the Kernel can't know that. If you do append writes from a FileChannel
though, the kernel does know that and the smaller area that needs to be synchronized, makes fsync
calls a whole lot cheaper.
This fact actually led to the benchmark result that,
One minor optimization for the mmap files may be to try to increase the page size
... the exact opposite is true :( Larger pages make the fsync
a ton more expensive, because you increase the size of the memory area that needs to be needlessly synchronized with the disk.
I don't think that the write speed of mmap vs. channel IO should drive the over all design decision since the bench-marking will be fundamentally tied to the hardware on which it is run.
Assuming the fsync
problem wouldn't be around I would agree here. The fsync
requirement though actually creates a situation where the I/O speed is bound by the design decision made, at least on systems that respect fsync
. AWS instances (ephemeral SSD and EBS at least) don't for example. Neither do many VM providers (there's actually blog posts out there that claim VM disk access is faster than direct for magical reasons:D) or ext3/4 setups if you set barrier=0
and so on :)
What we should optimize for is resource consumption of the writes, here mmap
as implemented by the JVM (note in the JVM's defense it actually does offer "mmap
" on systems that don't do that call natively) comes with a lot of overhead compared to the outright system call from C
(see the fact that we have to manually force the Cleaner
etc. here).
Hope this clears some of this stuff and the way the discussion went so far up :)
@original-brownbear - Thanks for explanation, that make much more sense.
I wrote a little application to try quantify the cost of fsync
between mmap vs. channel io for sequential IO. (I think the code is illustrative of the issue here, if not please let me know)
https://github.com/jakelandis/fsync_compare
You can run yourself (or pull from source):
cd /tmp
wget https://github.com/jakelandis/fsync_compare/raw/master/bin/fsync_compare-0.1-SNAPSHOT.jar
java -Xmx1024m -jar fsync_compare-0.1-SNAPSHOT.jar
I am getting really odd results based on the system on which i am running:
Channel Test
==============
Creating a 1GB tempfile : /tmp/channel273162193519157734.tmp
....
# of Writes : 3486175
All total time : 4230738593 ns
Fsync total time : 4172383 ns
Fsync Percent of total time : 0.09862067599504841 %
MMap Test
==============
Creating a 1GB tempfile : /tmp/mmap6091803940181452000.tmp
....
# of Writes : 3486175
All total time : 422384143 ns
Fsync total time : 3389767 ns
Fsync Percent of total time : 0.8025317844377506 %
Channel Test
==============
Creating a 1GB tempfile : /var/folders/x4/76h2pqn96jl492qycz3fz7q00000gn/T/channel3614441115713821871.tmp
.....
# of Writes : 3486175
All total time : 6854179446 ns
Fsync total time : 942155809 ns
Fsync Percent of total time : 13.745712618449588 %
MMap Test
==============
Creating a 1GB tempfile : /var/folders/x4/76h2pqn96jl492qycz3fz7q00000gn/T/mmap2460314887547029005.tmp
....
# of Writes : 3486175
All total time : 308262885215 ns
Fsync total time : 307584243992 ns
Fsync Percent of total time : 99.77984984389974 %
The difference could be the mount options for /tmp on Linux vs. /var on mac (though I didn't explicitly see it in fstab).
Not necessarily trying accomplish anything specific here except understand the performance differences.
Ahh.. the issue was the /tmp directory on Linux ... mapping the queue file directly to a regular directory (slightly old SSD) gets similar results to the Mac
Channel Test
==============
Creating a 1GB tempfile : /apps/tmp1
...
# of Writes : 3486175
All total time : 16385347605 ns
Fsync total time : 8102758793 ns
Fsync Percent of total time : 49.45124747019366 %
MMap Test
==============
Creating a 1GB tempfile : /apps/tmp2
....
# of Writes : 3486175
All total time : 8311587013 ns
Fsync total time : 7535912499 ns
Fsync Percent of total time : 90.6675522642453 %
@jakelandis thanks for this benchmark, the difference tmpfs
to persistent partition is pretty instructive here imo :)
What's interesting is that on Linux while a much higher percentage (40% more) of the time is spent on Fsync via MMap writes, but the overall time and the total Fsync time is actually lower.
Channel IO = 16.38s MMap IO = 8.31s .5s less time in fsync for MMap IO
I would encourage others to check my benchmark code for logic errors or errors that may skew the results, and to run on newer hardware [1].
To avoid the tmpfs
skew change the temp directory like this:
java -Xmx1024m -Djava.io.tmpdir=/apps -jar fsync_compare-0.1-SNAPSHOT.jar
[1] Test run on an 2007 8 core/32GB MacPro 1,1 with Xeon 2.66 running Fedora 25
mount/fs:
/dev/disk/by-uuid/626fca5e-587e-4c9a-8807-20b3d5ce7f59 /apps auto nosuid,nodev,nofail,x-gvfs-show,x-gvfs-name=ssd 0 0
/dev/sdc1 ext4 220G 307M 208G 1% /apps
@jakelandis I think the reason for this is the file system capability "sparse allocation" that OSX doesn't have. => OSX has to actually allocate the disk space prior to the run, Linux doesn't have to because it has sparse allocation. Little short on time right now, but Google fill in the blanks I just left here easily I think :)
Now mmap forces synchronizing the whole file (doesn't mean you have to really write all sectors, but there is overhead in figuring out what to write) even though only a small section of its data was changed between synchronizations.
Can you cite where this happens "overhead in figuring out"? I'm not disputing, I'm trying to understand more what exactly happens.
I tried to look at msync vs fdatasync on Linux. Both msync and fdatasync manpages claim to flush "all modified in-core data". I went walking in the linux kernel code to see what kind of work is done but got lost in a maze of method calls -- I may revisit later. Based on my reading, it seems like both msync fdatasync have to do the same basic work (iterate over dirty pages + flush to disk), but it sounds like from your description, mmap has to iterate all pages to even find which ones are dirty?
I'm interested in studying, so I will see about writing a test that compares mmap+force vs filechannel+fsync
What's interesting is that on Linux while a much higher percentage (40% more) of the time is spent on Fsync via MMap writes, but the overall time and the total Fsync time is actually lower.
@original-brownbear - Sorry, I what I meant was:
What's interesting is that on Linux, when comparing Channel IO vs. MMap on Linux, while a much higher percentage (40% more) of the time is spent on Fsync via MMap writes, but the overall time and the total Fsync time is actually lower.
I am not concerned about OSX/MacOs.
(no one cares about OSX/HFS in production) agreed
@colinsurprenant
Also, did you compare throughput between SSD and spinning disks?
Same machine as above but with (software) RAID 1 spinny disks.
Channel Test
==============
Creating a 1GB tempfile : /storage/channel3898164774960818867.tmp
# of Writes : 3486175
All total time : 228764226834 ns
Fsync total time : 220368264387 ns
Fsync Percent of total time : 96.32986216280554 %
MMap Test
==============
Creating a 1GB tempfile : /storage/mmap8942863789224568451.tmp
# of Writes : 3486175
All total time : 255131738496 ns
Fsync total time : 254270271278 ns
Fsync Percent of total time : 99.66234415871645 %
Here Channel IO wins.
@jordansissel
but it sounds like from your description, mmap has to iterate all pages to even find which ones are dirty?
Jup, I can add a link to source code later once I have more time (ping me if you still care and I forget), but if you think about it, this is actually self-evident and must be true for all systems because:
Assuming I mmap
some file, now I modify the buffer but simultaneously modify the physical file as well (just open another fd on it) and then msync
the buffer. The modifications done via the second fd
need to be overwritten, but the only way this can happy is if the msync
looks for dirty pages again :) Since the file is not synced back to the mapped buffer in any deterministic way, it simply doesn't suffice for the buffer to just track which of its memory addresses it altered, it needs to look back into the fs.
Hope this makes sense, otherwise better explanation later :)
@jordansissel
I'm interested in studying, so I will see about writing a test that compares mmap+force vs filechannel+fsync
That is what fsync_compare does, though it maybe poorly named and is pretty limited in scope w/r/t to permutations of test conditions.
@jakelandis I ran your test on my laptop (linux vm) with two modifications:
1) Set the java tmpdir to somewhere that hits disk (/tmp is hosted in RAM with tmpfs on this machine)
2) comment out the .
printing to clean up the output for my eyes (should not impact the results)
% java -Djava.io.tmpdir=$HOME/build/tmp/ -jar target/fsync_compare-0.1-SNAPSHOT.jar
Channel Test
==============
Creating a 1GB tempfile : /home/jls/build/tmp/channel7259058140351604702.tmp
# of Writes : 3486175
All total time : 39483991393 ns
Fsync total time : 35971409167 ns
Fsync Percent of total time : 91.10378129951995 %
MMap Test
==============
Creating a 1GB tempfile : /home/jls/build/tmp/mmap1796784660482438602.tmp
# of Writes : 3486175
All total time : 40406788454 ns
Fsync total time : 39461419475 ns
Fsync Percent of total time : 97.66037090506158 %
The resulting execution time is nearly identical (40.4 seconds MMap, 39.4 seconds FileChannel), so it is likely the "fsync percent total" (97% for MMap, 91% for FileChannel) is obscuring something (for example, a full buffer will cause a FileChannel.write() to block where an MMap.put will never block)
Running perf record
(and later, perf report
) on both @jakelandis' mmap and channel tests --
% sudo perf record -e 'block:*' -o mmap.perf java -Djava.io.tmpdir=$HOME/build/tmp -jar target/fsync_compare-0.1-SNAPSHOT.jar mmap
% sudo perf record -e 'block:*' -o channel.perf java -Djava.io.tmpdir=$HOME/build/tmp -jar target/fsync_compare-0.1-SNAPSHOT.jar channel
perf event | mmap | channel |
---|---|---|
block:block_touch_buffer | 0 | 0 |
block:block_dirty_buffer | 263K | 3M |
block:block_rq_abort | 0 | 0 |
block:block_rq_requeue | 0 | 0 |
block:block_rq_complete | 44 | 44 |
block:block_rq_insert | 8K | 6K |
block:block_rq_issue | 8K | 6K |
block:block_bio_bounce | 0 | 0 |
block:block_bio_complete | 63 | 12 |
block:block_bio_backmerge | 3K | 7 |
block:block_bio_frontmerge | 0 | 0 |
block:block_bio_queue | 23K | 13K |
block:block_getrq | 8K | 6K |
block:block_sleeprq | 0 | 0 |
block:block_plug | 6K | 6K |
block:block_unplug | 6K | 6K |
block:block_split | 0 | 0 |
block:block_bio_remap | 20K | 10K |
block:block_rq_remap | 0 | 0 |
perf trace -s
for channel vs mmap in @jakelandis's test:
fdatasync() vs msync() performance is practically identical. Raw data in the table below. I ran this test in a Fedora 23 VM that uses xfs for the filesystem.
operation | minimum | average | max | stddev |
---|---|---|---|---|
fdatasync (channel) | 7.103ms | 11.112ms | 83.761ms | 0.56% |
msync (mmap) | 7.081ms | 11.454ms | 78.684ms | 0.59% |
Left side is FileChannel. Right side is MMap -- perf trace -s
output only showing the main java thread (I think).
Notably, the test I am running assumes one write == one event (filechannel). In some future where we support bulk writes into the queue, one FileChannel.write() is likely to be many (10-1000) events which would significantly reduce the number of write()
syscalls
java (8811), 20988182 events, 99.7% | java (8746), 14013259 events, 99.7%
|
syscall calls total min avg max stddev | syscall calls total min avg max stddev
(msec) (msec) (msec) (msec) (%) | (msec) (msec) (msec) (msec) (%)
--------------- -------- --------- --------- --------- --------- ------ | --------------- -------- --------- --------- --------- --------- ------
clock_gettime 7000707 59935.226 0.000 0.009 13.935 0.16% | clock_gettime 6999698 68895.173 0.000 0.010 45.412 0.22%
fdatasync 3404 37823.723 7.103 11.112 83.761 0.56% | msync 3404 38988.243 7.081 11.454 78.684 0.59%
write 3486210 34226.998 0.000 0.010 11.871 0.24% | read 1304 14.945 0.005 0.011 0.212 2.86%
read 1406 11.675 0.005 0.008 0.623 5.51% | lseek 1244 14.412 0.005 0.012 0.467 4.87%
futex 182 10.018 0.005 0.055 4.108 43.47% | futex 145 4.094 0.005 0.028 0.276 17.06%
lseek 1330 9.382 0.005 0.007 0.247 3.07% | mprotect 126 1.876 0.006 0.015 0.151 8.53%
gettimeofday 185 1.767 0.005 0.010 0.056 4.61% | gettimeofday 152 1.689 0.005 0.011 0.146 9.33%
mprotect 124 1.450 0.005 0.012 0.185 12.58% | mmap 104 1.638 0.006 0.016 0.284 16.74%
mmap 107 1.138 0.005 0.011 0.054 6.46% | open 60 1.162 0.008 0.019 0.114 11.38%
clone 11 1.128 0.000 0.103 0.179 14.53% | clone 11 1.145 0.000 0.104 0.163 14.58%
open 67 1.061 0.005 0.016 0.134 14.39% | write 35 0.973 0.006 0.028 0.121 18.67%
stat 58 0.757 0.005 0.013 0.035 6.92% | stat 54 0.815 0.007 0.015 0.057 8.82%
fstat 63 0.449 0.005 0.007 0.022 7.71% | lstat 53 0.624 0.006 0.012 0.048 9.71%
lstat 53 0.442 0.005 0.008 0.025 6.72% | munmap 27 0.449 0.010 0.017 0.062 12.00%
close 54 0.435 0.005 0.008 0.082 18.46% | fstat 61 0.438 0.005 0.007 0.033 8.95%
munmap 31 0.359 0.006 0.012 0.026 8.22% | sched_getaffinity 4 0.427 0.005 0.107 0.339 73.95%
rt_sigaction 25 0.155 0.005 0.006 0.018 9.53% | close 51 0.362 0.005 0.007 0.021 8.02%
access 7 0.119 0.008 0.017 0.031 17.26% | madvise 2 0.176 0.014 0.088 0.161 83.71%
socket 3 0.064 0.010 0.021 0.031 29.02% | unlink 2 0.161 0.037 0.081 0.124 53.93%
rt_sigprocmask 10 0.058 0.005 0.006 0.008 5.55% | rt_sigaction 25 0.146 0.005 0.006 0.010 5.47%
unlink 2 0.052 0.011 0.026 0.040 56.16% | socketpair 1 0.136 0.136 0.136 0.136 0.00%
getdents 6 0.047 0.005 0.008 0.012 16.26% | getdents 6 0.128 0.009 0.021 0.076 51.96%
socketpair 1 0.044 0.044 0.044 0.044 0.00% | socket 3 0.121 0.015 0.040 0.071 40.68%
fcntl 8 0.043 0.005 0.005 0.006 4.26% | access 7 0.109 0.011 0.016 0.032 18.15%
fchdir 4 0.036 0.005 0.009 0.018 34.77% | ftruncate 3 0.086 0.009 0.029 0.046 37.49%
getrlimit 6 0.032 0.005 0.005 0.008 10.32% | connect 2 0.077 0.017 0.038 0.060 55.41%
sched_getaffinity 4 0.031 0.005 0.008 0.012 19.76% | rt_sigprocmask 10 0.062 0.005 0.006 0.010 8.98%
getcwd 2 0.024 0.006 0.012 0.018 51.66% | mkdir 1 0.049 0.049 0.049 0.049 0.00%
connect 2 0.022 0.007 0.011 0.015 37.47% | fcntl 8 0.046 0.005 0.006 0.006 3.31%
geteuid 4 0.020 0.005 0.005 0.005 4.42% | getrlimit 6 0.031 0.005 0.005 0.005 1.97%
set_robust_list 1 0.018 0.018 0.018 0.018 0.00% | geteuid 4 0.023 0.006 0.006 0.006 3.19%
sched_yield 2 0.016 0.006 0.008 0.010 22.45% | fchdir 4 0.021 0.005 0.005 0.006 2.45%
madvise 1 0.012 0.012 0.012 0.012 0.00% | sched_yield 2 0.015 0.006 0.007 0.009 19.05%
ftruncate 2 0.011 0.005 0.006 0.006 12.29% | clock_getres 2 0.012 0.005 0.006 0.006 8.15%
gettid 2 0.011 0.005 0.006 0.006 11.64% | gettid 2 0.010 0.005 0.005 0.005 2.02%
clock_getres 2 0.010 0.005 0.005 0.005 1.04% | getcwd 1 0.009 0.009 0.009 0.009 0.00%
getuid 1 0.010 0.010 0.010 0.010 0.00% | getsockname 1 0.008 0.008 0.008 0.008 0.00%
uname 1 0.006 0.006 0.006 0.006 0.00% | getuid 1 0.006 0.006 0.006 0.006 0.00%
mkdir 1 0.006 0.006 0.006 0.006 0.00% | uname 1 0.006 0.006 0.006 0.006 0.00%
getsockname 1 0.006 0.006 0.006 0.006 0.00% | set_robust_list 1 0.005 0.005 0.005 0.005 0.00%
setrlimit 1 0.005 0.005 0.005 0.005 0.00% | setrlimit 1 0.005 0.005 0.005 0.005 0.00%
|
Closing - no followup is planned about this in the foreseeable future. We will reboot if need be.
The current implementation in
master
usesmmap
to write to fixed size pages that are created and written to sequentially. This introduces a lot of overhead for the VM system, requires hacking around the GC'sCleaner
behavior for themmap
ed buffers and hence comes with unpredictable and suboptimal performance.https://github.com/elastic/logstash/pull/7316 contains an extremely trivial implementation illustrating the problem.
It simply writes to the channel directly and avoids
mmap
for writes. The durability features of the approach are exactly the same as those in themaster
implementation sinceMappedBuffer.force()
is simply replaced by callingChannel.force()
on the underlyingFileChannel
.The branch passes all tests and the performance difference is about a factor of
3
(plus a lot more consistent throughput).Before (
master
):After:
There's probably a bit of a speedup (50%) achievable with almost no effort by buffering writes in line with durability settings instead of straight flushing the buffer on every write, but I didn't want to invest more than a few minutes just yet before we had a chance to talk about things :) Also note that the current implementation maxes out already at about twice the above throughput if no physical writes are made to the channel (just out-comment the actual write), so we need to optimize the surroundings before we want more than this imo, we're far from I/O bound here even with the linked PR.
So for me, the approach here seems straight forward:
What do you guys think?