Closed asfimport closed 13 years ago
Michael McCandless (@mikemccand) (migrated from JIRA)
I'm hitting this, when trying to convert the 20090306 Wikipedia export to a line file:
Exception in thread "Thread-0" java.lang.ArrayIndexOutOfBoundsException: 2048
at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.apache.lucene.benchmark.byTask.feeds.EnwikiDocMaker$Parser.run(EnwikiDocMaker.java:77)
at java.lang.Thread.run(Thread.java:619)
From this:
http://marc.info/?l=xerces-j-user&m=120452263925040&w=2
It sounds likely an upgrade to xerces 2.9.1 will fix it. I'm testing it now... if it fixes the issue, I'll commit the upgrade to contrib/benchmark.
Michael McCandless (@mikemccand) (migrated from JIRA)
So, after upgrading to xerces 2.9.1, I then hit this error:
Exception in thread "Thread-0" java.lang.RuntimeException: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence.
at org.apache.lucene.benchmark.byTask.feeds.EnwikiDocMaker$Parser.run(EnwikiDocMaker.java:101)
at java.lang.Thread.run(Thread.java:619)
Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence.
at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.apache.lucene.benchmark.byTask.feeds.EnwikiDocMaker$Parser.run(EnwikiDocMaker.java:77)
... 1 more
It appears I'm hitting XERCESJ-1257 which, hideously, is still open. Worse, we are already doing the suggested workaround at the bottom of the issue. Hmm.
Michael McCandless (@mikemccand) (migrated from JIRA)
After some iterations on XERCESJ-1257, I managed to apply the original patch on that issue (thank you Robert!), which indeed allows me to process all of Wikipedia's XML export. I'll commit a recompiled xerces 2.9.1 jar with that patch shortly.
Shai Erera (@shaie) (migrated from JIRA)
I wonder why does EnwikiDocMaker extend LineDocMaker? The latter assumes the input is given in lines, while the former assumes an XML format ... so why the inheritance?
This affects EnwikiDocMaker today when LDM.openFile() instantiates a BufferedReader, which is never used by EDM. Is it because of DocState? Perhaps some of the logic in LDM can be pulled up to BasicDocMaker, or a new abstract DocStateDocMaker? If there is a good reason, then maybe introduce a protected member useReader and set it to false in EDM? Or override openFile() in EDM and not instantiate the reader?
Also, somewhat unrelated to this issue, but I found two issues in LDM:
Michael McCandless (@mikemccand) (migrated from JIRA)
I wonder why does EnwikiDocMaker extend LineDocMaker?
I'm not sure... I agree it'd be cleaner to not subclass LineDocMaker, and factor out DocState into BasicDocMaker.
Should we first check forever, and only if it's true call openFile()?
Yes, let's fix that!
resetInputs() reads the docs.file property and throws an exception if it's not set. Shouldn't this code belong to setConfig?
I think it should, but I vaguely remember some odd reason why I put it in resetInputs... try moving it and see?
Shai Erera (@shaie) (migrated from JIRA)
resetInputs() is called from PerfRunData's ctor (as is setConfig), but also from ResetInputsTask. Unless it is possible to change the file name in the middle of execution, I see no reason why not move it to setConfig.
I'll move it to setConfig and also switch to throw IllegalArgEx, insteas of RuntimeEx.
Another change I'd like to do is remove the while(true) in makeDoc. All it does is read 1 line and breaks, unless that line is null in which case it reopens the file and reads a line again. I think that in that case, which will happen only after all docs were consumed, and if forever is set to true, we can just call makeDoc again, and avoid the 1-instruction loop in every makeDoc call.
Michael McCandless (@mikemccand) (migrated from JIRA)
OK sounds good!
Shai Erera (@shaie) (migrated from JIRA)
Before I post a patch I wanted to test reading the 20090306 enwiki dump and write it as a one line document, all using the bz2 in/out streams. After 9 hours and 2881000 documents (!!!), I've hit the following exception:
Exception in thread "Thread-1" java.lang.ArrayIndexOutOfBoundsException
at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.apache.lucene.benchmark.byTask.feeds.EnwikiDocMaker$Parser.run(EnwikiDocMaker.java:76)
at java.lang.Thread.run(Thread.java:810)
Same exception like Mike hit, only from a different method. I'm using the latest xerces jar Mike put. I'm beginning to think this enwiki dump is jinxed :)
Anyway, I'll post the patch shortly and run on the 20070527 version to verify.
Shai Erera (@shaie) (migrated from JIRA)
The patch touches LineDocMaker, EnwikiDocMaker and WriteLineDocTask. Also, put ant-1.7.1 in benchmark/lib
Uwe Schindler (@uschindler) (migrated from JIRA)
Do you know http://commons.apache.org/compress/ ?
It is a commons project that replicates the internals from ANT and othe projects for general usage. It is not yet released, but available as snapshot jars. TIKA uses it, too. It also contains BZIPInputStream. I would prefer this instead of polluting the classpath with a full ant distribution.
Michael McCandless (@mikemccand) (migrated from JIRA)
Odd – with that patched xerces JAR I was able to parse the full XML. Is it possible your bunzipping code is messing up the XML?
Shai, why did it take 9 hours to get to that exception? Is bunzip that slow? That seems crazy. (Or are you running tests on a snail-of-a-machine? ;) )
Can you run only your bunzip code and confirm it produces an XML file that's identical to what bunzip2 from the command line produces? (And measure how long it takes vs the command line).
Shai Erera (@shaie) (migrated from JIRA)
That's the way I wrap FIS with BZIP:
if (doBzipCompression) {
// According to CBZip2InputStream's documentation, we should first
// consume the first two file header chars ('B' and 'Z'), as well as
// wrap the underlying stream with a BufferedInputStream, since CBZip2IS
// uses the read() method exclusively.
fileIS = new BufferedInputStream(fileIS, READER_BUFFER_BYTES);
fileIS.read(); fileIS.read();
fileIS = new CBZip2InputStream(fileIS);
}
Is it possible your bunzipping code is messing up the XML?
I successfully read the file and compressed it with Java's GZIP classes, however I did not attempt to parse the XML itself. Did you run EnwikiDocMaker on the actual XML or the bz2 archive? The 20070527 run should end soon (I hope - it reached 2.2M documents, so if it doesn't fail, I guess that bzip wrapping is very unlikely to affect the XML parsing.
Shai, why did it take 9 hours to get to that exception? Is bunzip that slow? That seems crazy.
I run the test on my TP 60, which is not a snail-of-a-machine, but definitely not a strong server. You can download the patch and the jar and try it out on your machine. But yes, I did notice bzip is very slow compared to gzip, however it has better compression ration. I do want to measure the times though, to give more accurate numbers, but in order to do that I need to finish a successful run first.
Can you run only your bunzip code and confirm it ...
I would have done that, but the output XML is 17GB, and doing it twice is not an option on my TP. That's why I wanted this bzip thing in the first place :) I'll try to do that with the 20070527 version, which hopefully will be \~half the size ...
Michael McCandless (@mikemccand) (migrated from JIRA)
Did you run EnwikiDocMaker on the actual XML or the bz2 archive?
I downloaded the bz2 2008036 Wikipedia export, ran bunzip2 on the command line, then had to patch Xerces JAR to get it to parse the XML successfully.
I run the test on my TP 60, which is not a snail-of-a-machine, but definitely not a strong server.
Hmm – I wonder how long bunzip2 would take on the TP 60. Time to upgrade ;) Get yourself an X25 SSD!
I would have done that, but the output XML is 17GB, and doing it twice is not an option on my TP. That's why I wanted this bzip thing in the first place
Ahh OK :)
Shai Erera (@shaie) (migrated from JIRA)
I downloaded the bz2 2008036
I'm almost sure its a typo, but just to verify - did download the 20090306 (enwiki-20090306-pages-articles.xml.bz2), or 2008036?
Anyway, I think I've found a problem. In the javadocs, they document that the IS version uses the readByte() exclusively, but don't say anything regarding their OS version. I read the code and noticed it always calls write() and never uses the array version. So I wrapped the FOS with a BOS (bufSize=64k) and then with BZOS. I did a short test, reading 2000 records from the 20070527 file, before and after the change:
Num Docs | Before | After | %tg |
---|---|---|---|
2000 | 106s | 30s | 72 |
I think that if that improvement is stable, than the 9 hours run should drop to \~3 hours, which seems right. I didn't measure the time to unzip the file using WinRAR (the first time I tried it), but it was a couple of hours run.
Once the current run will complete, I'll kick off a new one with that code change and note the time difference. I'm eager to see it speeds up, but I want to complete a successful run before :)
Shai Erera (@shaie) (migrated from JIRA)
Another thing I noticed is that WriteLineDocTask calls flush() after every document it writes. Any reason to do it? We use BufferedWriter, and calling flush() after every document is a bit expensive, I think. I quickly measured the same 2000 documents run and it finished in 28 seconds, 7% improvement compared to the 'after' run and 74% improvement compared to the 'before'. So if there's a good reason, we can keep it - the performance gain is not that high, but otherwise I think we should remove it, and count on PerfTask.close() being called at the end of the run (perhaps the absence of close() was the reason to call flush() in the first place?).
Michael McCandless (@mikemccand) (migrated from JIRA)
I'm almost sure its a typo, but just to verify - did download the 20090306 (enwiki-20090306-pages-articles.xml.bz2), or 2008036?
Sorry I meant 20090306.
I did a short test, reading 2000 records from the 20070527 file, before and after the change:
Excellent!
Another thing I noticed is that WriteLineDocTask calls flush() after every document it writes. Any reason to do it?
Hmm that should not be needed; I'd say remove it? But, implement close() to actually close the stream?
Shai Erera (@shaie) (migrated from JIRA)
But, implement close() to actually close the stream?
Already did, I had to because otherwise the bzip file wasn't sealed properly (that's why I started the other thread about tracking task resources). It already exists in the attached patch.
I'm finishing a run with the updated code (wrapping w/ BOS), so once that finishes, I'll post an updated patch and some numbers.
Michael McCandless (@mikemccand) (migrated from JIRA)
Already did, I had to because otherwise the bzip file wasn't sealed properly (that's why I started the other thread about tracking task resources). It already exists in the attached patch.
Oh yeah right, I already forgot. Feels so long ago ;)
Shai Erera (@shaie) (migrated from JIRA)
Here some numbers:
Overall, I think the performance of the BZIP classes is reasonable. Most of the time spent in the algorithm is in compressing the data, which is usually a process done only once. The result is a 2.5GB enwiki file compressed to a 2.31GB one-line file (8.5GB uncompressed content).
I compared the time it takes to read 100k lines from the compressed and un-compressed one-line file: compressed-2.26m, un-compressed-1.36m (-66%). The difference is significant, however I'm not sure how much is it from the overall process (i.e., reading the documents and indexing them). On my machine it would take 1.1 hours to read the data, but I'm sure it will take more to index it, and the indexing time is the same whether we read the data from a bzip archive or not.
I'll attach the patch shortly, and I think overall this is a good addition. It is off by default, and configurable, so if someone doesn't care about disk space, he can always run the indexing algorithm on an un-compressed one-line file.
Shai Erera (@shaie) (migrated from JIRA)
Patch includes:
There is an open question regarding the ant-1.7.1.jar dependency. Uwe mentioned the commons Compress project, which handles the bzip format (as well as others). I took a look and found no place to download a jar, as well as this looks like a 'young' project, with very little documentation. This is not to say the code is of low quality or not be trusted, it's just that I prefer the ant dependency, at least until this project matures enough. And anyway I guess everyone who uses Lucene has Ant in his system, so this doesn't look like a major dependency.
However, if you think otherwise, then we should get a jar from there (checking out the code and building it manually is the only way I see, but please correct me if I'm wrong) and adapt the code to use it, do perf. measurements again etc.
Shai Erera (@shaie) (migrated from JIRA)
BTW, the enhancements to EnwikiDocMaker yielded another 2% improvement to the process of converting the enwiki file to a one-line file. Just a FYI. I basically wait with 1595 (refactoring to benchmark) until this one is committed, so the sooner the better ;)
Michael McCandless (@mikemccand) (migrated from JIRA)
Should we consider using compress form Apache commons (from Uwe's comment [above|#action_12698191]) instead of full ant jar?
I basically wait with 1595 (refactoring to benchmark) until this one is committed, so the sooner the better
Does this issue depend on #2669?
Uwe Schindler (@uschindler) (migrated from JIRA)
The problem is, that the project is currently moving to commons top-level. The SVN pathes changed, but website was not updated and so on. The snapshot jars are not accessible at the moment. I could quickly build a JAR and attach it here. To get the code running, you only have to change the package imports. Ideally one would use the Factory to create the decompressor (and then he do not need to skip the 2 bytes with "BZ"). Uwe
Shai Erera (@shaie) (migrated from JIRA)
Does this issue depend on #2669?
No, the other way around. Well ... it's not an actual dependency, just that 1595 will touch a lot of files, and I want to minimize the noise of working on two issues that touch the same files (1595 will touch all the files this one touches) simultaneously. It's just a matter of convenience ...
Besides, I don't see what else can be done as part of this issue. The performance is reasonable, the code is quite simple. The patch includes some more enhancements to those files that is unrelated to bzip per sei, but are still required.
BTW, I successfully executed indexLineFile.alg on the 20070527 one-line bz2 file and the overall indexing process ended in 1h, which seems reasonable to me.
Regarding Apache Compress, I asked the same question, so it's not fair to return it with a question ;). I don't think we should decide that now. It can be changed in 1595 if we think Compress is the better approach. Personally I prefer the ant jar, even though I realize it's adding a large dependency for just 3-4 classes ...
Shai Erera (@shaie) (migrated from JIRA)
Uwe, if you can attach the jar here, I can make the necessary code changes and run some tests again. We can the decide based on whether it's working with the Compress classes or not.
Uwe Schindler (@uschindler) (migrated from JIRA)
Here the latest snapshot build of commons compress. All test passed through "mvn install" run. About the initial "BZh" bytes. In the javadocs still stands, that they should be read before opening the strea, But the examples on the website and the BZip2Decompressor code is:
private void init() throws IOException {
if (null == in) {
throw new IOException("No InputStream");
}
if (in.available() == 0) {
throw new IOException("Empty InputStream");
}
checkMagicChar('B', "first");
checkMagicChar('Z', "second");
checkMagicChar('h', "third");
So I think, the reading of the initial two bytes can be left out. If something is wrong, this class should throw an IOException.
Here some usage: http://wiki.apache.org/commons/Compress (this shows, that decompressing a bzip2 file does not need to skip the header), here the javadocs: http://commons.apache.org/compress/apidocs/index.html
Shai Erera (@shaie) (migrated from JIRA)
Ok I'm convinced. I moved to commons-compress and it works great. The jar is smaller and it does add the logical dependency. Since this project is still young we should expect changes, which is good since it means we can actually improve the In(Out) compressing streams to use more efficient methods, such as read(byte[]) and write(byte[]).
Michael McCandless (@mikemccand) (migrated from JIRA)
Patch looks good!
Could you add a test case that eg writes a bzip'd line file, then reads it back & indexes it, or something along those lines?
Also: should we make "use bzip" pay attention to suffix when defaulting itself? Ie if I explicitly specify "bzip.compression" then listen to me, but if I didn't specify it and my line file source ends with .bz2, default it to true? (And likewise for WriteLineDoc)?
Uwe Schindler (@uschindler) (migrated from JIRA)
I created my first bug report for Compress handling the inconsistency in javadocs and the compressor part with the Bzip2 header (compression does not add header, decompression needs header): COMPRESS-69
Shai Erera (@shaie) (migrated from JIRA)
argh, you bit me here - I planned to do so myself :) for some reason their OutputStream has the file headers commented out with a comment saying "this is added by the caller", however their InputStream reads them ... strange. Anyway, once that's fixed and we upgrade to a proper jar, the unit test I am working on now will fail, and it will remind us to remove writing the headers in WriteLineDocTask.
Mike - I am working on the unit test as well as defaulting by extension. I hope a patch will be available sometime later today.
Uwe Schindler (@uschindler) (migrated from JIRA)
It is fixed now, including the JavaDocs: COMPRESS-69
Shai Erera (@shaie) (migrated from JIRA)
I updated the code from SVN, but I still see wrong javadocs. In the class javadocs, for both classes, first line still says "(without file headers)". Also, Bzip2TestCase has a xtestBzipCreation() - the 'x' prevents this test from running as JUnit - is that intentional? I removed the 'x' and the test passes.
Uwe Schindler (@uschindler) (migrated from JIRA)
I added as comment to COMPRESS-69:
you forgot to enable the test again...
He disabled the test (he added the de/encode test directly after opening the issue because of my comment of a missing test) because it failed until he had a solution.
Uwe Schindler (@uschindler) (migrated from JIRA)
Now it's really fixed: compression and decompression are working similar, test case enabled, and javadocs fixed. That was really fast issue fixing, congratulations to COMPRESS :-)
Shai Erera (@shaie) (migrated from JIRA)
Great ! Uwe, can you please update the jar in this issue? I will make sure the test passes with it.
Uwe Schindler (@uschindler) (migrated from JIRA)
Here is it. I thought you had checked it out, too, and created a JAR yourself. I have not done anything other. It's the (renamed) JAR file from the "target" dir after "mvn install".
Shai Erera (@shaie) (migrated from JIRA)
Sorry about that. I didn't know what to do with the pom.xml. Given your comment above, I'll install maven and use it next time :)
Shai Erera (@shaie) (migrated from JIRA)
Patch includes:
Michael McCandless (@mikemccand) (migrated from JIRA)
I had some trouble w/ the patch...
First, I had to edit contrib/benchmark's build.xml to add the compress JAR onto the classpath (things wouldn't compile otherwise).
Then I see failures in TestPerfTasksParse, eg:
[junit] java.lang.Exception: Error: cannot understand algorithm!
[junit] at org.apache.lucene.benchmark.byTask.Benchmark.<init>(Benchmark.java:63)
[junit] at org.apache.lucene.benchmark.byTask.TestPerfTasksParse.doTestAllTasksSimpleParse(TestPerfTasksParse.java:171)
[junit] at org.apache.lucene.benchmark.byTask.TestPerfTasksParse.testAllTasksSimpleParse(TestPerfTasksParse.java:140)
[junit] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[junit] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
[junit] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
[junit] at java.lang.reflect.Method.invoke(Method.java:597)
[junit] at junit.framework.TestCase.runTest(TestCase.java:164)
[junit] at junit.framework.TestCase.runBare(TestCase.java:130)
[junit] at junit.framework.TestResult$1.protect(TestResult.java:106)
[junit] at junit.framework.TestResult.runProtected(TestResult.java:124)
[junit] at junit.framework.TestResult.run(TestResult.java:109)
[junit] at junit.framework.TestCase.run(TestCase.java:120)
[junit] at junit.framework.TestSuite.runTest(TestSuite.java:230)
[junit] at junit.framework.TestSuite.run(TestSuite.java:225)
[junit] at org.junit.internal.runners.OldTestClassRunner.run(OldTestClassRunner.java:35)
[junit] at junit.framework.JUnit4TestAdapter.run(JUnit4TestAdapter.java:32)
[junit] at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:421)
[junit] at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:912)
[junit] at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:766)
[junit] Caused by: java.lang.reflect.InvocationTargetException
[junit] at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
[junit] at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
[junit] at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
[junit] at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
[junit] at org.apache.lucene.benchmark.byTask.utils.Algorithm.<init>(Algorithm.java:69)
[junit] at org.apache.lucene.benchmark.byTask.Benchmark.<init>(Benchmark.java:61)
[junit] ... 19 more
[junit] Caused by: java.lang.IllegalArgumentException: line.file.out must be set
[junit] at org.apache.lucene.benchmark.byTask.tasks.WriteLineDocTask.<init>(WriteLineDocTask.java:73)
[junit] ... 25 more
And the new LineDocMakerTest fails with this:
[junit] Testcase: testBZip2WithBzipCompressionDisabled(org.apache.lucene.benchmark.byTask.feeds.LineDocMakerTest): FAILED
[junit] expected:<1> but was:<0>
[junit] junit.framework.AssertionFailedError: expected:<1> but was:<0>
[junit] at org.apache.lucene.benchmark.byTask.feeds.LineDocMakerTest.doIndexAndSearchTest(LineDocMakerTest.java:96)
[junit] at org.apache.lucene.benchmark.byTask.feeds.LineDocMakerTest.testBZip2WithBzipCompressionDisabled(LineDocMakerTest.java:119)
WriteLineDocTest shows a similar failure. Not sure what's up...
Shai Erera (@shaie) (migrated from JIRA)
That's strange ... About the build.xml, I think the problem lies in line 110, where the classpath defines explicit jars. I changed it to:
<path id="classpath">
<pathelement path="${common.dir}/build/classes/java"/>
<pathelement path="${common.dir}/build/classes/demo"/>
<pathelement path="${common.dir}/build/contrib/highlighter/classes/java"/>
<fileset dir="lib">
<include name="**/*.jar"/>
</fileset>
</path>
and it compiled successfully. I think this change is good since it will prevent such problems in the future (in case more dependencies will be added).
About the test failures - they pass for me in eclipse however fail in Ant. I believe I know the reason - previously, WriteLineDocTask's ctor logic was in its setUp method. I moved it to ctor since setUp is called for every document, and the initialization there did not seem right to me. The "line.file.out' property is indeed mandatory, and hence the exception. The reason it doesn't fail in eclipse is because this task is not explicitly defined in findTasks(), and I don't have the "tasks.dir" env variable defined. As soon as I add this line:
tsks.add( " WriteLineDoc " );
to findTasks(), the test fails.
I see several ways to solve it:
Michael McCandless (@mikemccand) (migrated from JIRA)
I think this change is good since it will prevent such problems in the future (in case more dependencies will be added).
That sounds good.
I agree WriteLineDocTask should pull its config in ctor, not setUp.
But: I don't think WriteLineDocTask should be created when it's not going to be used, ie the TestPerfTasksParse.doTestAllTasksSimpleParse seems wrong?
Shai Erera (@shaie) (migrated from JIRA)
Not sure what you mean. The test does use any Task, just attempts to parse algorithm texts with those tasks defined. Do you suggest we exclude WriteLineDocTask from the test? Perhaps we can wrap the new Benchmark() call with a try-catch on IAE and log such tests but don't fail? That way, if a certain task has mandatory properties, it shouldn't fail the test ... Another option is to define for each tested task mandatory properties, in addition to the common ones used for all tasks ...
Unless I misunderstand you, I don't see why this test is wrong.
Michael McCandless (@mikemccand) (migrated from JIRA)
The test seems to assume you can take any Task in the source tree, and make an alg that simply creates that task.
I think that assumption is in fact wrong, because tasks like WriteLineDocTask indeed require certain configuration (line.file.out) be set, and the test can't know that. Other tasks in the future will presumably hit the same issue.
Also, thinking about the test, I think it doesn't add much value? Elsewhere we heavily test that the .alg parser works properly. And all this test does is take every task, and stick it in either "XXX", "[ XXX ] : 2" or "{ XXX } : 3", parse it, and verify it parsed properly.
I think we should simply turn those three tests off? Or, if that seems to drastic, simply skipping WriteLineDocTask seems OK too?
Shai Erera (@shaie) (migrated from JIRA)
We can turn them off, or wrap new Benchmark with try-catch Exception, logging a failed task. Alternatively, we can add an 'exclude' list which will define tasks that should be discarded by the test, and add WriteLineDocTask to it.
However, if you think those are useless, i.e. we test .alg parsing elsewhere (and I agree these tests don't add much value), then I agree we should remove them, rather than working hard to mask the test's limitations.
Michael McCandless (@mikemccand) (migrated from JIRA)
OK, let's just remove them. Can you post new patch? Thanks.
Shai Erera (@shaie) (migrated from JIRA)
All benchmark tests pass. Note: when you apply the patch, make sure you include the latest commons-compress jar Uwe uploaded.
Michael McCandless (@mikemccand) (migrated from JIRA)
Hmm I'm still hitting some errors, eg:
[junit] Testcase: testRegularFileWithBZipCompressionEnabled(org.apache.lucene.benchmark.byTask.tasks.WriteLineDocTaskTest): FAILED
[junit] expected:<3> but was:<1>
[junit] junit.framework.AssertionFailedError: expected:<3> but was:<1>
[junit] at org.apache.lucene.benchmark.byTask.tasks.WriteLineDocTaskTest.doReadTest(WriteLineDocTaskTest.java:87)
[junit] at org.apache.lucene.benchmark.byTask.tasks.WriteLineDocTaskTest.testRegularFileWithBZipCompressionEnabled(WriteLineDocTaskTest.java:144)
[junit]
and
[junit] Testcase: testBZip2WithBzipCompressionDisabled(org.apache.lucene.benchmark.byTask.feeds.LineDocMakerTest): FAILED
[junit] expected:<1> but was:<0>
[junit] junit.framework.AssertionFailedError: expected:<1> but was:<0>
[junit] at org.apache.lucene.benchmark.byTask.feeds.LineDocMakerTest.doIndexAndSearchTest(LineDocMakerTest.java:96)
[junit] at org.apache.lucene.benchmark.byTask.feeds.LineDocMakerTest.testBZip2WithBzipCompressionDisabled(LineDocMakerTest.java:119)
Shai Erera (@shaie) (migrated from JIRA)
That's strange ... I did the following:
testRegularFileWithBZipCompressionEnabled simulates an attempt to read a bz2 file as a regular file. The very first readLine() should throw a MalformedException or something ... that's what the test is counting on. It seems that in your case this line succeeds, reading something, and then fails on String.split(), since probably it didn't read something meaningful. I don't understand why this would happen though .... Can you run this test alone, w/o the rest? Perhaps debug-trace it? The test does not delete the in/output file before and after the test, but relies on FileInputStream(String/File) ctor which is supposed to re-create the file, even if it exists. Could it be that in your case it doesn't happen?
I assume the second exception is thrown for the same reason. Following the steps I've done above to apply the patch, I don't understand why the test fails on your machine ...
Michael McCandless (@mikemccand) (migrated from JIRA)
So, for LineDocMakerTest.testBZip2WithBzipCompressionDisabled, indeed LineDocMaker opens the binary file, but then no exception is hit: it looks for a tab delimeter, and when it can't find one, sets body/title/date to "" and adds the doc anyway.
In your case you hit some exception – can you e.printStackTrace(System.out) and post back what exception that is? Maybe somehow your bzip2 is putting a tab in the binary but mine's not?
Shai Erera (@shaie) (migrated from JIRA)
sun.io.MalformedInputException
at sun.io.ByteToCharUTF8.convert(ByteToCharUTF8.java:262)
at sun.nio.cs.StreamDecoder$ConverterSD.convertInto(StreamDecoder.java:314)
at sun.nio.cs.StreamDecoder$ConverterSD.implRead(StreamDecoder.java:364)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:250)
at java.io.InputStreamReader.read(InputStreamReader.java:212)
at java.io.BufferedReader.fill(BufferedReader.java:157)
at java.io.BufferedReader.readLine(BufferedReader.java:320)
at java.io.BufferedReader.readLine(BufferedReader.java:383)
at org.apache.lucene.benchmark.byTask.feeds.LineDocMaker.makeDocument(LineDocMaker.java:187)
at org.apache.lucene.benchmark.byTask.tasks.AddDocTask.setup(AddDocTask.java:61)
at org.apache.lucene.benchmark.byTask.tasks.PerfTask.runAndMaybeStats(PerfTask.java:92)
at org.apache.lucene.benchmark.byTask.tasks.TaskSequence.doSerialTasks(TaskSequence.java:148)
at org.apache.lucene.benchmark.byTask.tasks.TaskSequence.doLogic(TaskSequence.java:129)
at org.apache.lucene.benchmark.byTask.feeds.LineDocMakerTest.doIndexAndSearchTest(LineDocMakerTest.java:92)
at org.apache.lucene.benchmark.byTask.feeds.LineDocMakerTest.testBZip2WithBzipCompressionDisabled(LineDocMakerTest.java:119)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:79)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:618)
at junit.framework.TestCase.runTest(TestCase.java:164)
at junit.framework.TestCase.runBare(TestCase.java:130)
at junit.framework.TestResult$1.protect(TestResult.java:106)
at junit.framework.TestResult.runProtected(TestResult.java:124)
at junit.framework.TestResult.run(TestResult.java:109)
at junit.framework.TestCase.run(TestCase.java:120)
at org.eclipse.jdt.internal.junit.runner.junit3.JUnit3TestReference.run(JUnit3TestReference.java:130)
at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196)
bzip compression can aid the benchmark package by not requiring extracting bzip files (such as enwiki) in order to index them. The plan is to add a config parameter bzip.compression=true/false and in the relevant tasks either decompress the input file or compress the output file using the bzip streams. It will add a dependency on ant.jar which contains two classes similar to GZIPOutputStream and GZIPInputStream which compress/decompress files using the bzip algorithm.
bzip is known to be superior in its compression performance to the gzip algorithm (\~20% better compression), although it does the compression/decompression a bit slower.
I wil post a patch which adds this parameter and implement it in LineDocMaker, EnwikiDocMaker and WriteLineDoc task. Maybe even add the capability to DocMaker or some of the super classes, so it can be inherited by all sub-classes.
Migrated from LUCENE-1591 by Shai Erera (@shaie), resolved Jan 31 2011 Attachments: commons-compress-dev20090413.jar (versions: 2), LUCENE-1591.patch (versions: 8)