Closed asfimport closed 13 years ago
Yonik Seeley (@yonik) (migrated from JIRA)
Patch for FSIndexInput to use a positional read call that doesn't use explicit synchronization. Note that the implementation of that read call may still involve some synchronization depending on the JVM and OS (notably Windows which lacks a native pread AFAIK).
Yonik Seeley (@yonik) (migrated from JIRA)
This change should be faster on heavily loaded multi-threaded servers using the non-compound index format. Performance tests are needed to see if there is any negative impact on single-threaded performance.
Compound index format (CSIndexInput) still does synchronization because the base IndexInput is not cloned (and hence shared by all CSIndexInput clones). It's unclear if getting rid of the synchronization is worth the cloning overhead in this case.
Doug Cutting (@cutting) (migrated from JIRA)
This patch continues to use BufferedIndexInput and allocates a new ByteBuffer for each call to read(). I wonder if it might be more efficient to instead directly extend IndexInput and always represent the buffer as a ByteBuffer?
Yonik Seeley (@yonik) (migrated from JIRA)
CSIndexInput synchronization could also be elimitated if there was a pread added to IndexInput
public abstract void readBytes(byte[] b, int offset, int len, long fileposition)
Unfortunately, that would break any custom Directory based implementations out there, and we can't provide a suitable default with seek & read because we don't know what object to synchronize on. Worth it or not???
Yonik Seeley (@yonik) (migrated from JIRA)
Here is a patch that directly extends IndexInput to make things a little easier. I started with the code for BufferedIndexInput to avoid any bugs in read(). They share enough code that a common subclass could be factored out if desired (or changes made in BufferedIndexInput to enable easier sharing).
ByteBuffer does have offset, length, etc, but I did not use them because BufferedIndexInput currently allocates the byte[] on demand, and thus would add additional checks to readByte(). Also, the NIO Buffer.get() isn't as efficient as our own array access.
Bogdan Ghidireac (migrated from JIRA)
You can find a NIO variation of IndexInput attached to this issue: #1597
I had good results on multiprocessor machines under heavy load.
Regards, Bogdan
Yonik Seeley (@yonik) (migrated from JIRA)
Thanks for the pointer Bogdan, it's interesting you use transferTo instead of read... is there any advantage to this? You still need to create a new object every read(), but at least it looks like a smaller object.
It's also been pointed out to me that #1492 has some more NIO code.
Bogdan Ghidireac (migrated from JIRA)
The Javadoc says that transferTo can be more efficient because the OS can transfer bytes directly from the filesystem cache to the target channel without actually copying them.
Yonik Seeley (@yonik) (migrated from JIRA)
> The Javadoc says that transferTo can be more efficient because the OS can transfer bytes > directly from the filesystem cache to the target channel without actually copying them.
Unfortunately, only for DirectByteBuffers and other FileChannels, not for HeapByteBuffers. Sounds like we just need to do some benchmarking, but I have a bad feeling that all the checking overhead Sun added to NIO will cause it to be slower in the single threaded case.
Yonik Seeley (@yonik) (migrated from JIRA)
Attaching test that reads a file in different ways, either random access or serially, from a number of threads.
Yonik Seeley (@yonik) (migrated from JIRA)
Single-threaded random access performance of a fully cached 64MB file on my home PC (WinXP) , Java6:
config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=1024 filelen=6518936 answer=81332126, ms=7781, MB/sec=167.5603649916463
config: impl=ChannelFile serial=false nThreads=1 iterations=200 bufsize=1024 filelen=6518936 answer=81332126, ms=9203, MB/sec=141.66980332500273
config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=1024 filelen=6518936 answer=81332126, ms=11672, MB/sec=111.70212474297463
config: impl=ChannelTransfer serial=false nThreads=1 iterations=200 bufsize=1024 filelen=6518936 answer=81332126, ms=17328, MB/sec=75.2416435826408
Brian Pinkerton (migrated from JIRA)
Most of my workloads would benefit by removing the synchronization in FSIndexInput, so I took a closer look at this issue. I found exactly the opposite results that Yonik did on two platforms that I use frequently in production (Solaris and Linux), and by a significant margin. I even get the same behavior on the Mac, though I'm not running Java6 there.
#1
SMP Tue Oct 18 18:39:27 EDT 2005 i686 i686 i386 GNU/Linuxconfig: impl=ChannelPread serial=false nThreads=200 iterations=10 bufsize=1024 filelen=10485760 answer=0, ms=88543, MB/sec=236.85124741650947 config: impl=ClassicFile serial=false nThreads=200 iterations=10 bufsize=1024 filelen=10485760 answer=0, ms=150560, MB/sec=139.29011689691816
config: impl=ChannelPread serial=false nThreads=200 iterations=10 bufsize=1024 filelen=10485760 answer=0, ms=39621, MB/sec=529.3031473208652
config: impl=ClassicFile serial=false nThreads=200 iterations=10 bufsize=1024 filelen=10485760 answer=0, ms=119057, MB/sec=176.14688762525515
Yonik Seeley (@yonik) (migrated from JIRA)
Brad, one possible difference is the number of threads we tested with. I tested single-threaded (nThreads=1) to see what kind of slowdown a single query might see.
A normal production system shouldn't see 200 concurrent running search threads unless it's just about to fall over, or unless it's one of those massive multi-core systems. After you pass a certain amount of parallelism, NIO can help.
Brian Pinkerton (migrated from JIRA)
Whoops; I should have paid more attention to the args. The results in the single-threaded case still favor pread, but by a slimmer margin:
Linux:
config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=1024 filelen=10485760 answer=0, ms=9983, MB/sec=210.0723229490133
config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=1024 filelen=10485760 answer=0, ms=9247, MB/sec=226.7926895209257
Solaris 10:
config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=1024 filelen=10485760 answer=0, ms=7381, MB/sec=284.12843788104595
config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=1024 filelen=10485760 answer=0, ms=6245, MB/sec=335.81297037630105
Mac OS X:
config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=1024 filelen=10485760 answer=-914995, ms=19945, MB/sec=105.14675357232389
config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=1024 filelen=10485760 answer=-914995, ms=26378, MB/sec=79.50382894836606
Doug Cutting (@cutting) (migrated from JIRA)
> Brad, [...]
That's Brian. And right, the difference in your tests is the number of threads.
Perhaps this is a case where one size will not fit all. MmapDirectory is fastest on 64-bit platforms with lots of threads, while good-old-FSDirectory has always been fastest for single-threaded access. Perhaps a PreadDirectory would be the Directory of choice for multi-threaded access of large indexes on 32-bit hardware? It would be useful to benchmark this patch against MmapDirectory, since they both remove synchronization.
Doug Cutting (@cutting) (migrated from JIRA)
My prior remarks were posted before I saw Brian's latest benchmarks.
While it would still be good to throw mmap into the mix, pread now looks like a strong contender for the one that might beat all. It works well on 32-bit hardware, it's unsynchronized, and it's fast. What's not to like?
Yonik Seeley (@yonik) (migrated from JIRA)
Weird... I'm still getting slower results from pread on WinXP. Can someone else verify on a windows box?
Yonik@spidey \~
$ c:/opt/jdk16/bin/java -server FileReadTest testfile ClassicFile false 1 200
config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=1024 filelen=9616000
answer=160759732, ms=14984, MB/sec=128.35024025627337
$ c:/opt/jdk16/bin/java -server FileReadTest testfile ClassicFile false 1 200
config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=1024 filelen=9616000
answer=160759732, ms=14640, MB/sec=131.36612021857923
$ c:/opt/jdk16/bin/java -server FileReadTest testfile ChannelPread false 1 200
config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=1024 filelen=9616000
answer=160759732, ms=21766, MB/sec=88.35798952494717
$ c:/opt/jdk16/bin/java -server FileReadTest testfile ChannelPread false 1 200
config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=1024 filelen=9616000
answer=160759732, ms=21718, MB/sec=88.55327378211622
$ c:/opt/jdk16/bin/java -version
java version "1.6.0_02"
Java(TM) SE Runtime Environment (build 1.6.0_02-b06)
Java HotSpot(TM) Client VM (build 1.6.0_02-b06, mixed mode)
robert engels (migrated from JIRA)
I sent this via email, but probably need to add to the thread...
I posted a bug on this to Sun a long while back. This is a KNOWN BUG on Windows.
NIO preads actually sync behind the scenes on some platforms. Using multiple file descriptors is much faster.
See bug http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265734
Doug Cutting (@cutting) (migrated from JIRA)
So it looks like pread is \~50% slower on Windows, and \~5-25% faster on other platforms. Is that enough of a difference that it might be worth having FSDirectory use different implementations of FSIndexInput based on the value of Constants.WINDOWS (and perhaps JAVA_VERSION)?
Michael McCandless (@mikemccand) (migrated from JIRA)
Is that enough of a difference that it might be worth having FSDirectory use different implementations of FSIndexInput based on the value of Constants.WINDOWS (and perhaps JAVA_VERSION)?
+1
I think having good out-of-the-box defaults is extremely important (most users won't tune), and given the substantial cross platform differences here I think we should conditionalize the defaults according to the platform.
robert engels (migrated from JIRA)
As an aside, if the Lucene people voted on the Java bug (and or sent emails via the proper channels), hopefully the underlying bug can be fixed in the JVM.
In my opinion it is a serious one - ruins any performance gains of using NIO on files.
Yonik Seeley (@yonik) (migrated from JIRA)
Updated test that fixes some thread synchronization issues to ensure that the "answer" is the same for all methods.
Brian, in some of your tests the answer is "0"... is this because your test file consists of zeros (created via /dev/zero or equiv)? UNIX systems treat blocks of zeros differently than normal files (they are stored as holes). It shouldn't make too much of a difference in this case, but just to be sure, could you try with a real file?
Brian Pinkerton (migrated from JIRA)
Yeah, the file was full of zeroes. But I created the files w/o holes and was using filesystems that don't compress file contents. Just to be sure, though, I repeated the tests with a file with random contents; the results above still hold.
Brian Pinkerton (migrated from JIRA)
BTW, I think the performance win with Yonik's patch for some workloads could be far greater than what the simple benchmark illustrates. Sure, pread might be marginally faster. But the real win is avoiding synchronized access to the file.
I did some IO tracing a while back on one particular workload that is characterized by:
In this workload where each query hits each compound index, the locking in FSIndexInput means that a single rare query clobbers the response time for all queries. The requests to read cached data are serialized (fairly, even) with those that hit the disk. While we can't get rid of the rare queries, we can allow the common ones to proceed against cached data right away.
Michael McCandless (@mikemccand) (migrated from JIRA)
I ran Yonik's most recent FileReadTest.java on the platforms below, testing single-threaded random access for fully cached 64 MB file.
I tested two Windows XP Pro machines and got opposite results from Yonik. Yonik is your machine XP Home?
I'm showing ChannelTransfer to be much faster on all platforms except Windows Server 2003 R2 Enterprise x64 where it's about the same as ChannelPread and ChannelFile.
The ChannelTransfer test is giving the wrong checksum, but I think just a bug in how checksum is computed (it's using "len" which with ChannelTransfer is just the chunk size written on each call to write). So I think the MB/sec is still correct.
Mac OS X 10.4 (Sun java 1.5) config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=-44611, ms=32565, MB/sec=412.15331797942576
config: impl=ChannelFile serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=-44611, ms=19512, MB/sec=687.8727347273473
config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=-44611, ms=19492, MB/sec=688.5785347835009
config: impl=ChannelTransfer serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=147783, ms=16009, MB/sec=838.3892060715847
Linux 2.6.22.1 (Sun java 1.5) config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=-44611, ms=37879, MB/sec=354.33281765622115
config: impl=ChannelFile serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=-44611, ms=21845, MB/sec=614.4093751430535
config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=-44611, ms=21902, MB/sec=612.8103734818737
config: impl=ChannelTransfer serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=147783, ms=15978, MB/sec=840.015821754913
Windows Server 2003 R2 Enterprise x64 (Sun java 1.6)
config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=-44611, ms=32703, MB/sec=410.4141149130049
config: impl=ChannelFile serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=-44611, ms=23344, MB/sec=574.9559972583961
config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=-44611, ms=23329, MB/sec=575.3256804835183
config: impl=ChannelTransfer serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=147783, ms=23422, MB/sec=573.0412774314747
Windows XP Pro SP2, laptop (Sun Java 1.5)
config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=-44611, ms=71253, MB/sec=188.36782731955148
config: impl=ChannelFile serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=-44611, ms=57463, MB/sec=233.57243443607192
config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=-44611, ms=58043, MB/sec=231.23844046655068
config: impl=ChannelTransfer serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=147783, ms=20039, MB/sec=669.7825640001995
Windows XP Pro SP2, older desktop (Sun Java 1.6)
config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=-44611, ms=53047, MB/sec=253.01662299470283
config: impl=ChannelFile serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=-44611, ms=34047, MB/sec=394.2130819161747
config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=-44611, ms=34078, MB/sec=393.8544750278772
config: impl=ChannelTransfer serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 answer=147783, ms=18781, MB/sec=714.6463340610192
Michael McCandless (@mikemccand) (migrated from JIRA)
I also just ran a test with 4 threads, random access, on Linux 2.6.22.1:
config: impl=ClassicFile serial=false nThreads=4 iterations=200 bufsize=6518936 filelen=67108864 answer=-195110, ms=120856, MB/sec=444.22363142913883
config: impl=ChannelFile serial=false nThreads=4 iterations=200 bufsize=6518936 filelen=67108864 answer=-195110, ms=88272, MB/sec=608.2006887801341
config: impl=ChannelPread serial=false nThreads=4 iterations=200 bufsize=6518936 filelen=67108864 answer=-195110, ms=77672, MB/sec=691.2026367288084
config: impl=ChannelTransfer serial=false nThreads=4 iterations=200 bufsize=6518936 filelen=67108864 answer=594875, ms=38390, MB/sec=1398.465517061735
ChannelTransfer got even faster (scales up with added threads better).
Yonik Seeley (@yonik) (migrated from JIRA)
Mike, it looks like you are running with a bufsize of 6.5MB! Apologies for my hard-to-use benchmark program :-(
Yonik Seeley (@yonik) (migrated from JIRA)
I'll try fixing the transferTo test before anyone re-runs any tests.
Michael McCandless (@mikemccand) (migrated from JIRA)
Doh!! Woops :) I will rerun...
Yonik Seeley (@yonik) (migrated from JIRA)
OK, uploading latest version of the test that should fix ChannelTransfer (it's also slightly optimized to not create a new object per call).
Well, at least we've learned that printing out all the input params for benchmarking programs is good practice :-)
Michael McCandless (@mikemccand) (migrated from JIRA)
Thanks! I'll re-run.
Well, at least we've learned that printing out all the input params for benchmarking programs is good practice :)
Yes indeed :)
Michael McCandless (@mikemccand) (migrated from JIRA)
OK my results on Win XP now agree with Yonik's.
On UNIX & OS X, ChannelPread is a bit (2-14%) better, but on windows it's quite a bit (31-34%) slower.
Win Server 2003 R2 Enterprise x64 (Sun Java 1.6):
config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=1024 filelen=67108864
answer=110480725, ms=68094, MB/sec=197.10654095808735
config: impl=ChannelFile serial=false nThreads=1 iterations=200 bufsize=1024 filelen=67108864
answer=110480725, ms=72594, MB/sec=184.88818359644048
config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=1024 filelen=67108864
answer=110480725, ms=98328, MB/sec=136.5000081360345
config: impl=ChannelTransfer serial=false nThreads=1 iterations=200 bufsize=1024 filelen=67108864
answer=110480725, ms=201563, MB/sec=66.58847506734867
Win XP Pro SP2, laptop (Sun Java 1.5):
config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=1024 filelen=67108864
answer=110480725, ms=47449, MB/sec=282.8673481000653
config: impl=ChannelFile serial=false nThreads=1 iterations=200 bufsize=1024 filelen=67108864
answer=110480725, ms=54899, MB/sec=244.4811890926975
config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=1024 filelen=67108864
answer=110480725, ms=71683, MB/sec=187.237877878995
config: impl=ChannelTransfer serial=false nThreads=1 iterations=200 bufsize=1024 filelen=67108864
answer=110480725, ms=149475, MB/sec=89.79275999330991
Linux 2.6.22.1 (Sun Java 1.5):
config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=1024 filelen=67108864
answer=110480725, ms=41162, MB/sec=326.0719304212623
config: impl=ChannelFile serial=false nThreads=1 iterations=200 bufsize=1024 filelen=67108864
answer=110480725, ms=53114, MB/sec=252.69745829724744
config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=1024 filelen=67108864
answer=110480725, ms=40226, MB/sec=333.65914582608264
config: impl=ChannelTransfer serial=false nThreads=1 iterations=200 bufsize=1024 filelen=67108864
answer=110480725, ms=59163, MB/sec=226.86092321214272
Mac OS X 10.4 (Sun Java 1.5):
config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=1024 filelen=67108864
answer=110480725, ms=85894, MB/sec=156.25972477705076
config: impl=ChannelFile serial=false nThreads=1 iterations=200 bufsize=1024 filelen=67108864
answer=110480725, ms=109939, MB/sec=122.08381738964336
config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=1024 filelen=67108864
answer=110480725, ms=75517, MB/sec=177.73180608339845
config: impl=ChannelTransfer serial=false nThreads=1 iterations=200 bufsize=1024 filelen=67108864
answer=110480725, ms=130156, MB/sec=103.12066136021389
Testo Nakada (migrated from JIRA)
I think bufsize has way much bigger impact than the implementation. I found that 64KB buffer size is at least 5-6 times faster than 1KB. Should we tune this parameter instead for maximum performance.
Jason Rutherglen (migrated from JIRA)
lucene-753.patch
Made NIOFSDirectory that inherits from FSDirectory and includes the patch.
Michael McCandless (@mikemccand) (migrated from JIRA)
Carrying forward from this thread:
Jason Rutherglen <jason.rutherglen@gmail.com> wrote:
After thinking more about the pool of RandomAccessFiles I think LUCENE-753 is the best solution. I am not sure how much work nor if pool of RandomAccessFiles creates more synchronization problems and if it is only to benefit windows, does not seem worthwhile.
It wasn't clear to me that pread would in fact perform better than letting each thread uses its own private RandomAccessFile.
So I modified (attached) FileReadTest.java to add a new SeparateFile implementation, which opens a private RandomAccessFile per-thread and then just does "classic" seeks & reads on that file. Then I ran the test on 3 platforms (results below), using 4 threads.
The results are very interesting – using SeparateFile is always faster, especially so on WinXP Pro (115% faster than the next fastest, ClassicFile) but also surprisingly so on Linux (44% faster than the next fastest, ChannelPread). On Mac OS X it was 5% faster than ChannelPread. So on all platforms it's faster, when using multiple threads, to use separate files.
I don't have a Windows server class machine readily accessible so if someone could run on such a machine, and run on other machines (Solaris) to see if these results are reproducible, that'd be great.
This is a strong argument for some sort of pooling of RandomAccessFiles under FSDirectory, though the counter balance is clearly added complexity. I think if we combined the two approaches (use separate RandomAccessFile objects per thread as managed by a pool, and then use the best mode (classic on Windows & channel pread on all others)) we'd likely get the best performance yet.
Mac OS X 10.5.3, single WD Velociraptor hard drive, Sun JRE 1.6.0_05
config: impl=ClassicFile serial=true nThreads=4 iterations=100 bufsize=1024 filelen=67108864
answer=-23909200, ms=151884, MB/sec=176.73715203708093
config: impl=SeparateFile serial=true nThreads=4 iterations=100 bufsize=1024 filelen=67108864
answer=-23909200, ms=97820, MB/sec=274.4177632386015
config: impl=ChannelPread serial=true nThreads=4 iterations=100 bufsize=1024 filelen=67108864
answer=-23909200, ms=103059, MB/sec=260.4677476008888
config: impl=ChannelFile serial=true nThreads=4 iterations=100 bufsize=1024 filelen=67108864
answer=-23909200, ms=176250, MB/sec=152.30380482269504
config: impl=ChannelTransfer serial=true nThreads=4 iterations=100 bufsize=1024 filelen=67108864
answer=-23909200, ms=365904, MB/sec=73.36226332589969
Linux 2.6.22.1, 6-drive RAID 5 array, Sun JRE 1.6.0_06
config: impl=ClassicFile serial=true nThreads=4 iterations=100 bufsize=1024 filelen=67108864
answer=-23909200, ms=75592, MB/sec=355.1109323737962
config: impl=SeparateFile serial=true nThreads=4 iterations=100 bufsize=1024 filelen=67108864
answer=-23909200, ms=35505, MB/sec=756.0497282072947
config: impl=ChannelPread serial=true nThreads=4 iterations=100 bufsize=1024 filelen=67108864
answer=-23909200, ms=51075, MB/sec=525.5711326480665
config: impl=ChannelFile serial=true nThreads=4 iterations=100 bufsize=1024 filelen=67108864
answer=-23909200, ms=95640, MB/sec=280.6727896277708
config: impl=ChannelTransfer serial=true nThreads=4 iterations=100 bufsize=1024 filelen=67108864
answer=-23909200, ms=93711, MB/sec=286.45031639828835
WIN XP PRO, laptop, Sun JRE 1.4.2_15:
config: impl=ClassicFile serial=true nThreads=4 iterations=100 bufsize=1024 filelen=67108864
answer=-23909200, ms=135349, MB/sec=198.32836297275932
config: impl=SeparateFile serial=true nThreads=4 iterations=100 bufsize=1024 filelen=67108864
answer=-23909200, ms=62970, MB/sec=426.2910211211688
config: impl=ChannelPread serial=true nThreads=4 iterations=100 bufsize=1024 filelen=67108864
answer=-23909200, ms=174606, MB/sec=153.73781886074937
config: impl=ChannelFile serial=true nThreads=4 iterations=100 bufsize=1024 filelen=67108864
answer=-23909200, ms=152171, MB/sec=176.4038193873997
config: impl=ChannelTransfer serial=true nThreads=4 iterations=100 bufsize=1024 filelen=67108864
answer=-23909200, ms=275603, MB/sec=97.39932293915524
Jason Rutherglen (migrated from JIRA)
Interesting results. The question would be, what would the algorithm for allocating RandomAccessFiles to which file look like? When would a new file open, when would a file be closed? If it is based on usage would it be based on the rate of calls to readInternal? This seems like an OS filesystem topic that maybe there is some standard algorithm for. How would the pool avoid the same synchronization issue given the default small buffer size of 1024? If there are 30 threads executing searches, there will not be 30 RandomAccessFiles per file so there is still contention over the limited number of RandomAccessFiles allocated.
Yonik Seeley (@yonik) (migrated from JIRA)
Added a PooledPread impl to FileReadTest, but at least on Windows it always seems slower than non-pooled. I suppose it might be because of the extra synchronization.
Michael McCandless (@mikemccand) (migrated from JIRA)
I think you have a small bug – minCount is initialized to 0 but should be something effectively infinite instead?
Yonik Seeley (@yonik) (migrated from JIRA)
Thanks Mike, after the bug is fixed, PooledPread is now faster on Windows when more than 1 thread is used.
Michael McCandless (@mikemccand) (migrated from JIRA)
OK I re-ran only PooledPread, SeparateFile and ChannelPread since they are the "leading contenders" on all platforms.
Also, I changed to serial=false.
Now the results are very close on all but windows, but on windows I'm seeing the opposite of what Yonik saw: PooledPread is slowest, and SeparateFile is fastest. But this is a laptop (Win XP Pro), and it is JRE 1.4. Also I ran with pool size == number of threads == 4.
Mac OS X 10.5.3, single WD Velociraptor hard drive, Sun JRE 1.6.0_05
config: impl=PooledPread serial=false nThreads=4 iterations=100 bufsize=1024 poolsize=4 filelen=67108864
answer=-23830370, ms=120807, MB/sec=222.20190551871994
config: impl=SeparateFile serial=false nThreads=4 iterations=100 bufsize=1024 poolsize=4 filelen=67108864
answer=-23830326, ms=119641, MB/sec=224.36744594244448
config: impl=ChannelPread serial=false nThreads=4 iterations=100 bufsize=1024 poolsize=4 filelen=67108864
answer=-23830370, ms=119217, MB/sec=225.1654176837196
Linux 2.6.22.1, 6-drive RAID 5 array, Sun JRE 1.6.0_06
config: impl=PooledPread serial=false nThreads=4 iterations=100 bufsize=1024 poolsize=4 filelen=67108864
answer=-23830370, ms=52613, MB/sec=510.2074696367818
config: impl=SeparateFile serial=false nThreads=4 iterations=100 bufsize=1024 poolsize=4 filelen=67108864
answer=-23830370, ms=52715, MB/sec=509.22025230010433
config: impl=ChannelPread serial=false nThreads=4 iterations=100 bufsize=1024 poolsize=4 filelen=67108864
answer=-23830370, ms=53792, MB/sec=499.0248661511005
WIN XP PRO, laptop, Sun JRE 1.4.2_15:
config: impl=PooledPread serial=false nThreads=4 iterations=100 bufsize=1024 poolsize=4 filelen=67108864
answer=-23830370, ms=209956, MB/sec=127.85319590771401
config: impl=SeparateFile serial=false nThreads=4 iterations=100 bufsize=1024 poolsize=4 filelen=67108864
answer=-23830370, ms=89101, MB/sec=301.27098012367986
config: impl=ChannelPread serial=false nThreads=4 iterations=100 bufsize=1024 poolsize=4 filelen=67108864
answer=-23830370, ms=184087, MB/sec=145.81988733587923
Brian Pinkerton (migrated from JIRA)
I was curious about the discrepancy between the ChannelPread implementation and the SeparateFile implementation that Yonik saw. At least on Mac OS X, the kernel implementation of read is virtually the same as pread, so there shouldn't be any appreciable performance difference unless the VM is doing something funny. Sure enough, the implementations of read() under RandomAccessFile and read() under FileChannel are totally different. The former relies on a buffer allocated either on the stack or by malloc, while the latter allocates a native buffer and copies the results to the original array.
Switching to a native buffer in the benchmark yields identical results for ChannelPread and SeparateFile on 1.5 and 1.6 on OS X. I'm attaching an implementation of ChannelPreadDirect that uses a native buffer.
This may be a moot point because any implementation inside Lucene needs to consume a byte[] and not a ByteBuffer, but at least it's informative.
Yonik Seeley (@yonik) (migrated from JIRA)
Here are some of my results with 4 threads and a pool size of 4 fds per file. Win XP on a Pentium4 w/ Java5_0_11 -server
config: impl=PooledPread serial=false nThreads=4 iterations=100 bufsize=1024 poolsize=4 filelen=9616000
answer=322211190, ms=51891, MB/sec=74.12460735002217
config: impl=ChannelPread serial=false nThreads=4 iterations=100 bufsize=1024 poolsize=4 filelen=9616000
answer=322211190, ms=71175, MB/sec=54.04144713733755
config: impl=ClassicFile serial=false nThreads=4 iterations=100 bufsize=1024 poolsize=4 filelen=9616000
answer=322211190, ms=62699, MB/sec=61.34707092617107
config: impl=SeparateFile serial=false nThreads=4 iterations=100 bufsize=1024 poolsize=4 filelen=9616000
answer=322211410, ms=21324, MB/sec=180.37891577565185
Michael McCandless (@mikemccand) (migrated from JIRA)
OK it's looking like SeparateFile is the best overall choice... it matches the best performance on Unix platforms and is very much the lead on Windows.
It's somewhat surprising to me that after all this time, with these new IO APIs, the most naive approach (using a separate RandomAccessFile per thread) still yields the best performance. In fact, opening multiple IndexReaders to gain concurrency is doing just this.
Of course this is a synthetic benchmark. Actual IO with Lucene is somewhat different. EG it's a mix of serial (when iterating through a term's docs with no skipping) and somewhat random access (when retrieving term vectors or stored fields), and presumably a mix of hits & misses to the OS's IO cache. So until we try this out with a real index and real queries we won't know for sure.
The question would be, what would the algorithm for allocating RandomAccessFiles to which file look like?
Ideally it would be based roughly on contention. EG a massive CFS file in your index should have a separate file per-thread, if there are not too many threads, whereas tiny CFS files in the index likely could share / synchronize on a single file
I think it would have thread affinity (if the same thread wants the same file we give back the same RandomAccessFile that thread last used, if it's available).
When would a new file open, when would a file be closed?
I think this should be reference counting. The first time Lucene calls FSDirectory.openInput on a given name, we must for-real open the file (Lucene relies on OS protecting open files). Further opens on that file incRef it. Closes decRef it and when the reference count gets to 0 we close it for real.
If it is based on usage would it be based on the rate of calls to readInternal?
Fortunately, Lucene tends to call IndexInput.clone() when it wants to actively make use of a file.
So I think the pool could work something like this: FSIndexInput.clone would "check out" a file from the pool. The pool decides at that point to either return a SharedFile (which has locking per-read, like we do now), or a PrivateFile (which has no locking because you are the only thread currently using that file), based on some measure of contention plus some configuration of the limit of allowed open files.
One problem with this approach is I'm not sure clones are always closed, since they are currently very lightweight and can rely on GC to reclaim them.
An alternative approach would be to sync() on every block (1024 bytes default now) read, find a file to use, and use it, but I fear that will have poor performance.
In fact, if we build this pool, we can again try all these alternative IO APIs, maybe even leaving that choice to the Lucene user as "advanced tuning".
robert engels (migrated from JIRA)
As I stated quit a while ago, this has been a long accepted bug in the JDK.
See http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265734
It was filed and accepted over 3 years ago.
The problem is that the pread performs an unnecessary lock on the file descriptor, instead of using Windows "overlapped" reads.
robert engels (migrated from JIRA)
The point being - please vote for this issue so it can be fixed properly. It is really a trivial fix, but it needs to be done by SUN.
Yonik Seeley (@yonik) (migrated from JIRA)
.bq OK it's looking like SeparateFile is the best overall choice... it matches the best performance on Unix platforms and is very much the lead on Windows.
The other implementations are fully-featured though (they could be used in lucene w/ extra synchronization, etc). SeparateFile (opening a new file descriptor per reader) is not a real implementation that could be used... it's more of a theoretical maximum IMO. Also remember that you can't open a new fd on demand since the file might already be deleted. We would need a real PooledClassicFile implementation (like PooledPread).
On non-windows it looks like ChannelPread is probably the right choice.. near max performance and min fd usage
Jason Rutherglen (migrated from JIRA)
Core2Duo Windows XP JDK1.5.15. PooledPread for 4 threads and pool size 2 the performance does not compare well to SeparateFile. PooledPread for 30 threads does not improve appreciably over ClassicFile. If there were 30 threads, how many RandomAccessFiles would there need to be to make a noticeable impact? The problem I see with the pooled implementation is setting the global file descriptor limit properly, will the user set this? There would almost need to be a native check to see if what the user is trying to do is possible.
The results indicate there is significant contention in the pool code. The previous tests used a pool size the same as the number of threads which is probably not how most production systems look, at least the SOLR installations I've worked on. In SOLR the web request thread is the thread that executes the search, so the number of threads is determined by the J2EE server which can be quite high. Unless the assumption is the system is set for an unusually high number of file descriptors.
There should probably be a MMapDirectory test as well.
config: impl=PooledPread serial=false nThreads=4 iterations=100 bufsize=1024 poolsize=2 filelen=18110448
answer=53797223, ms=32715, MB/sec=221.4329573590096
config: impl=SeparateFile serial=false nThreads=4 iterations=100 bufsize=1024 poolsize=2 filelen=18110448
answer=53797223, ms=18687, MB/sec=387.6587574249478
config: impl=SeparateFile serial=false nThreads=30 iterations=100 bufsize=1024 poolsize=2 filelen=18110448
answer=403087371, ms=137871, MB/sec=394.0737646060448
config: impl=PooledPread serial=false nThreads=30 iterations=100 bufsize=1024 poolsize=2 filelen=18110448
answer=403087487, ms=526504, MB/sec=103.19265190767781
config: impl=ChannelPread serial=false nThreads=30 iterations=100 bufsize=1024 poolsize=2 filelen=18110448
answer=403087487, ms=624291, MB/sec=87.02887595688549
config: impl=ClassicFile serial=false nThreads=30 iterations=100 bufsize=1024 poolsize=2 filelen=18110448
answer=403087487, ms=587430, MB/sec=92.48990347786119
config: impl=PooledPread serial=false nThreads=30 iterations=100 bufsize=1024 poolsize=4 filelen=18110448
answer=403087487, ms=552971, MB/sec=98.25351419875544
Michael McCandless (@mikemccand) (migrated from JIRA)
SeparateFile (opening a new file descriptor per reader) is not a real implementation that could be used... it's more of a theoretical maximum IMO. Also remember that you can't open a new fd on demand since the file might already be deleted. We would need a real PooledClassicFile implementation (like PooledPread).
True, we'd have to make a real pool, but I'd think we want the sync() to be on cloning and not on every read. I think Lucene's usage of the open files (clones are made & used up quickly and closed) would work well with that approach. I think at this point we should build out an underlying pool and then test all of these different approaches under the pool.
And yes we cannot just open a new fd on demand if the file has been deleted. But I'm thinking that may not matter in practice. Ie if the pool wants to open a new fd, it can attempt to do so, and if the file was deleted it must then return a shared access wrapper to the fd it already has open. Large segments are where the contention will be and large segments are not often deleted. Plus people tend to open new readers if such a large change has taken place to the index.
On non-windows it looks like ChannelPread is probably the right choice.. near max performance and min fd usage
But on Linux I saw 44% speedup for serial=true case with 4 threads using SeparateFile vs ChannelPread, which I was very surprised by. But then again it's synthetic so it may not matter in real Lucene searches.
Jason Rutherglen (migrated from JIRA)
lucene-753.patch
Added javadoc and removed unnecessary NIOFSIndexOutput class.
Yonik Seeley (@yonik) (migrated from JIRA)
(clones are made & used up quickly and closed)
IIRC, clones are often not closed at all. And for term expanding queries, you can get a lot of them all at once.
And yes we cannot just open a new fd on demand if the file has been deleted. But I'm thinking that may not matter in practice. Ie if the pool wants to open a new fd, it can attempt to do so, and if the file was deleted it must then return a shared access wrapper to the fd it already has open.
At first blush, sounds a bit too complex for the benefits.
But on Linux I saw 44% speedup for serial=true case with 4 threads using SeparateFile vs ChannelPread, which I was very surprised by.
In the serial case, there are half the system calls (no seek). When both implementations have a single single system call, all the extra code and complexity that Sun threw into FileChannel comes into play. Compare that with RandomAccessFile.read() which drops down to a native call and presumably just the read with little overhead. I wish Sun would just add a RandomAccessFile.read with a file position.
If access will be truly serial sometimes, larger buffers would help with that larger read() setup cost.
As suggested by Doug, we could use NIO pread to avoid synchronization on the underlying file. This could mitigate any MT performance drop caused by reducing the number of files in the index format.
Migrated from LUCENE-753 by Yonik Seeley (@yonik), 5 votes, resolved Jan 25 2011 Attachments: FileReadTest.java (versions: 8), FSDirectoryPool.patch, FSIndexInput.patch (versions: 2), lucene-753.patch (versions: 2), LUCENE-753.patch (versions: 5)