Lucene benchmark: objective performance test for Lucene [LUCENE-675]

asfimport commented 18 years ago

We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.

Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

Migrated from LUCENE-675 by Andrzej Bialecki (@sigram), 3 votes, resolved Jan 13 2007 Attachments: benchmark.byTask.patch, benchmark.patch, byTask.2.patch.txt, byTask.jre1.4.patch.txt, extract_reuters.plx, LuceneBenchmark.java, taskBenchmark.zip, timedata.zip, tiny.alg, tiny.properties

asfimport commented 18 years ago

Andrzej Bialecki (@sigram) (migrated from JIRA)

This is just a starting point for discussion - it's a pretty old file I found lying around, so it may not even compile with modern Lucene. Requires commons-compress.

asfimport commented 18 years ago

Paul Smith (migrated from JIRA)

If you're looking for freely available text in bulk, what about:

http://www.gutenberg.org/wiki/Main_Page

asfimport commented 18 years ago

Andrzej Bialecki (@sigram) (migrated from JIRA)

Yes, that could be a good additional source. However, IMHO the primary corpus should be widely known and standardized, hence my proposal of the Reuters.

(I mistakenly copy&paste-d the urls in the comment above - of course the corpus they're pointing at is the "20 Newsgroups", not the Reuters one. Correct url for the Reuters corpus is http://www.daviddlewis.com/resources/testcollections/reuters21578/ ).

asfimport commented 18 years ago

Paul Smith (migrated from JIRA)

From a strict performance point of view, a standard set of important, but don't forget other languages.

From a tokenization point of view (seperate to this issues), perhaps the Gutenberg project would be useful to test correctness of the analysis phase.

asfimport commented 18 years ago

Karl Wettin (migrated from JIRA)

It is also interesting to know how much time is consumed to assemble an instance of Document from the storage. According to my own tests this is the major reason to why InstantiatedIndex is so much faster than a FS/RAMDirectory. I also presume it to be the bottleneck of any RDBMS-, RMI- or any other "proxy"-based storage.

asfimport commented 18 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

Since this has dependencies, do you think we should put it under contrib? I would be for a Performance directory and we could then organize it from there. Perhaps into packages for quantitative and qualitative performance.

asfimport commented 18 years ago

Andrzej Bialecki (@sigram) (migrated from JIRA)

The dependency on commons-compress could be avoided - I used this just to be able to unpack tar.gz files, we can use Ant for that. If you meant the dependency on the corpus - can't Ant download this too as a dependency?

Re: Project Gutenberg - good point, this is a good source for multi-lingual documents. The "Europarl" collection is another, although a bit more hefty, so that could be suitable for running large-scale benchmarks, and texts from Project Gutenberg for running small-scale tests.

asfimport commented 18 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

Yeah, ANT can do this, I think. Take a look at the DB contrib package, it downloads. I think I can setup the necessary stuff in contrib, if people think that is a good idea. First contribution will be this file and then we can go from there. I think Otis has run some perf. stuff too, but I am not sure if it can be contributed. I think someone else has really studied query perf. so it would be cool if that was added too.

asfimport commented 18 years ago

Otis Gospodnetic (@otisg) (migrated from JIRA)

I still haven't gotten my employer to sign and fax the CCLA, so I'm stuck and can't contribute my search benchmark.

I have a suggestion for a name for this - Lube, for Lucene Benchmark - contrib/lube.

asfimport commented 18 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I think this is an incredibly important initiative: with every non-trivial change to Lucene (eg lock-less commits) we must verify performance did not get worse. But, as things stand now, it's an ad-hoc thing that each developer needs to do.

So (as a consumer of this), I would love to have a ready-to-use standard test that I could run to check if I've slowed things down with lock-less commits.

In the mean time I've been using Europarl for my testing.

Also important to realize is there are many dimensions to test. With lock-less I'm focusing entirely on "wall clock time to open readers and writers" in different use cases like pure indexing, pure searching, highly interactive mixed indexing/searching, etc. And this is actually hard to test cleanly because in certain cases (highly interactive case, or many readers case), the current Lucene hits many "commit lock" retries and/or timeouts (whereas lock-less doesn't). So what's a "fair" comparison in this case?

In addition to standardizing on the corpus I think we ideallly need standardized hardware / OS / software configuration as well, so the numbers are easily comparable across time. Even the test process itself is important, eg details like "you should reboot the box before each run" and "discard results from first run then take average of next 3 runs as your result", are important. It would be wonderful if we could get this into a nightly automated regression test so we could track over time how the performance has changed (and, for example, quickly detect accidental regressions). We should probably open this as a separate issue which depends first on this issue being complete.

asfimport commented 18 years ago

Mike Klaas (migrated from JIRA)

A few notes on benchmarks:

First, it is important to realize that no benchmark will ever fully-capture all aspects of lucene performance, particularly since so many real-world data distributions are so varied. That said, they are useful tools, especially if they are componentized to measure various aspects of lucene performance (the narrower the goal of the benchmark it, the better a benchmark can be created).

It is rather unrealistic to expect to standardize hardware / os ... better to compare before/after numbers on a single configuration, rather than comparing the numbers among configurations. The test process is important, but anything crucial should be built into the test (like the number of iterations; taking the average, etc). Concerning the specifics of this: Requiring reboots is onerous and not an important criterion (at least for unix systems-~~I'm not sufficiently familiar with windows to comment). Better to stipulate a relatively quiscient machine. Or perhaps not~~-it might be useful to see how the machine load affects lucene performance. Also, the arithmetic mean is a terrible way of combining results due to its emphasis on outliers. Better is the average over minimum times of small sets of runs.

Of course, any scheme has its problems. In general, the most important thing when using benchmarks is being aware of the limitations of the benchmark and methodology used.

asfimport commented 18 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

My comments are marked by GSI

In the mean time I've been using Europarl for my testing.

GSI: perhaps you can contribute once this is setup

Also important to realize is there are many dimensions to test. With lock-less I'm focusing entirely on "wall clock time to open readers and writers" in different use cases like pure indexing, pure searching, highly interactive mixed indexing/searching, etc. And this is actually hard to test cleanly because in certain cases (highly interactive case, or many readers case), the current Lucene hits many "commit lock" retries and/or timeouts (whereas lock-less doesn't). So what's a "fair" comparison in this case?

GSI: I am planning on taking Andrzej contribution and refactoring it into components that can be reused, as well as creating a "standard" benchmark which will be easy to run through a simple ant task, i.e. ant run-baseline

GSI: From here, anybody can contribute their own (I will provide interfaces to facilitate this) benchmarks which others can choose to run.

In addition to standardizing on the corpus I think we ideallly need standardized hardware / OS / software configuration as well, so the numbers are easily comparable across time.

GSI: Not really feasible unless you are proposing to buy us machines :-) I think more important is the ability to do a before and after evaluation (that runs each test several times) as you make changes. Anybody should be able to do the same. Run the benchmark, apply the patch and then rerun the benchmark.

asfimport commented 18 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

First – I think it's a good initiative. Grant, when you're thinking about the infrastructure, it would be pretty neat to have a way of logging performance in a way so that one could draw charts from them. You know, for the visual folks :)

Anyway, my other idea is that benchmarking Lucene can be performed on two levels: one is the user level, where the entire operation counts (such as indexing, searching etc). Another aspect is measurement of atomic parts within the big operation so that you know how much of the whole thing each subpart takes. I wrote an interesting piece of code once that allows measuring times for named operation (per-thread) in a recursive way. Looks something like this:

perfLogger.start("indexing"); try { .. code (with recursion etc) ... perfLogger.start("subpart"); try {

} finally { perfLogger.stop(); } } finally { perfLogger.stop(); }

in the output you get something like this:

indexing: 5 seconds; ->subpart: : 2 seconds; -> ...

Of course everything comes at a price and the above logging costs some CPU cycles (my implementation stored a nesting stack in ThreadLocals).

One can always put that code in 'if' clauses attached to final variables and enable logging only for benchmarking targets (the compiler will get rid of logging statements then).

If folks are interested I can dig out that performance logger and maybe adopt it to what Grant comes up with.

asfimport commented 18 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I agree: a simple ant-accessible benchmark to enable "before and after" runs is an awesome step forward. And that a standardized HW/SW testing environment is not really realistic now.

> GSI: perhaps you can contribute once this is setup

I will try!

asfimport commented 18 years ago

Doron Cohen (migrated from JIRA)

Few things that would be nice to have in this performance package/framework -

() indexing only overall time. () indexing only time changes as the index grows (might be the case that indexing performance starts to misbehave from a certain size or so). () search single user while indexing () search only single user () search only concurrent users () short queries () long queries () wild card queries () range queries () queries with rare words () queries with common words () tokenization/analysis only (above indexing measurements include tokenization, but it would be important to be able to "prove" to oneself that tokenization/analysis time is not hurt by a recent change).

() parametric control over: () () location of test input data. () () location of output index. () () location of output log/results. () () total collection size (total number of bytes/characters read from collection) () () document (average) size (bytes/chars) - test can break input data and recompose it into documents of desired size. () () "implicit iteration size" - merge-factor, max-buffered-docs () () "explicit iteration size" - how often the perf test calls () () long queries text () () short queries text () () which parts of the test framework capabilities to run () () number of users / threads. () () queries pace - how many queries are fired in, say, a minute.

Additional points: () Would help if all test run parameters are maintained in a properties (or xml config) file, so one can easily modify the test input/output without having to recompile the code. () Output to allow easy creation of graphs or so - perhaps best would be to have an result object, so others can easily extend with additional output formats. () index size as part of output. () number of index files as part of output ? () indexing input module that can loop over the input collection. This allows to test indexing of a collection larger than the actual input collection being used.

asfimport commented 18 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

OK, I have a preliminary implementation based on adapting Andrzej's approach. The interesting thing about this approach, is it is easy to adapt to be more or less exhaustive (i.e. how many of the parameters does one wish to have the system alter as it runs) Thus, you can have it change the merge factors, max buffered docs, number of documents indexed, number of different queries run, etc. The tradeoff, of course, is the length of time it takes to run these.

So my question to those interested, is what is a good baseline running time for testing in a standard way? My initial thought is to have something that takes between 15-30 minutes to run, but I am not sure on this. Another approach would be to have three "baselines": 1. quick validation (5 minutes to run...) 2. standard (15-45) 3. exhaustive (1-10 hours).

I know several others have built benchmarking suites for their internal use, what has been your strategy?

Thoughts, ideas, insights?

Thanks, Grant

asfimport commented 18 years ago

Marvin Humphrey (migrated from JIRA)

The indexing benchmarking apps I wrote take command line arguments for how many docs and how many reps. My standard test is to do 1000 docs and 6 reps. Within a couple seconds the first rep is done and the app is printing out results. For rapid development, having something that speedy is really handy.

asfimport commented 18 years ago

Doug Cutting (@cutting) (migrated from JIRA)

As Marvin points out, quick micro-benchmarks are great to have. But other effects only show up when things get very large. So I think we need at least two baselines: micro and macro.

asfimport commented 18 years ago

Marvin Humphrey (migrated from JIRA)

Grant had asked me if he could reuse some code from the indexer benchmarks I wrote. Here are the relevant files, contributed with the expectation they will be cannibalized, not included verbatim.

asfimport commented 18 years ago

Marvin Humphrey (migrated from JIRA)

One more file...

asfimport commented 17 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

OK, here is a first crack at a standard benchmark contribution based on Andrzej original contribution and some updates/changes by me. I wasn't nearly as ambitious as some of the comments attached here, but I think most of them are good things to strive for and will greatly benefit Lucene.

I checked in the basic contrib directory structure, plus some library dependencies, as I wasn't sure how svn diff handles those. I am posting this in patch format to solicit comments first instead of just committing and accepting patches. My thoughts are I'll take a round of comments and make updates as warranted and then make an initial commit.

I am particularly interested in the interface/Driver specification and whether people think this approach is useful or not. My thoughts behind it were it might be nice to have a standard way of creating/running benchmarks that could be driven by XML configuration files (some examples are in the conf directory). I am not 100% sold on this and am open to compelling arguments why we should just have each benchmark have it's own main() method.

As for the actual Benchmarker, I have created a "standard" version, which runs off the Reuters collection that is downloaded automatically by the ANT task. There are two ANT targets for the two benchmarks: run-micro-standard and run-standard. The micro version takes a few minutes to run on my machine (it indexes 2000 docs), the other one takes a lot longer.

There are several support classes in the stats and util packages. The stats package supports building and maintaining information about benchmarks. The utils package contains one class for extracting information out of the Reuters documents for indexing.

The ReutersQueries class contains a set of Queries I created by looking at some of the docs in the collection and are a myriad of term, phrase, span, wildcard and other types of queries. They aren't exhaustive by any means.

It should be stressed that these benchmarks are best used in gathering before and after numbers. Furthermore, these aren't the be all end all of benchmarking for Lucene. I hope the interface nature will encourage others to submit benchmarks for specific areas of Lucene not covered by this version.

Thanks to all who contributed their code/thoughts. Patch to follow

asfimport commented 17 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

Initial Benchmark code based on Andrzej original contribution plus some changes by me to use the Reuters "standard" collection maintained at http://www.daviddlewis.com/resources/testcollections/reuters21578/reuters21578.tar.gz

asfimport commented 17 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

To run, checkout contrib/benchmark and then apply the benchmark.patch in the contrib/benchmark directory.

asfimport commented 17 years ago

Doron Cohen (migrated from JIRA)

I tried it and it is working nice! - 1st run downloaded the documents from the Web before starting to index. 2nd run started right off - as input docs are already in place - great.

Seems the only output is what is printed to stdout, right?

I got something like this:

 [echo] Working Directory: work
 [java] Testing 4 different permutations.
 [java] #-- ID: td-00_10_10, Sun Nov 05 22:40:49 PST 2006, heap=1065484288 –
 [java] # source=work\reuters-out, directory=org.apache.lucene.store.FSDirectory@D:\devoss\lucene\java\trunk\contrib\benchmark\work\index
 [java] # maxBufferedDocs=10, mergeFactor=10, compound=true, optimize=true
 [java] # Query data: R-reopen, W-warmup, T-retrieve, N-no
 [java] # qd-0110 R W NT [body:salomon]
 [java] # qd-0111 R W T [body:salomon]
 [java] # qd-0100 R NW NT [body:salomon]

... [java] # qd-14011 NR W T [body:fo*] [java] # qd-14000 NR NW NT [body:fo*] [java] # qd-14001 NR NW T [body:fo*]

 [java] Start Time: Sun Nov 05 22:41:38 PST 2006
 [java]  - processed 500, run id=0
 [java]  - processed 1000, run id=0
 [java]  - processed 1500, run id=0
 [java]  - processed 2000, run id=0
 [java] End Time: Sun Nov 05 22:41:48 PST 2006
 [java] warm = Warm Index Reader
 [java] srch = Search Index
 [java] trav = Traverse Hits list, optionally retrieving document

 [java] # testData id   operation   runCnt  recCnt  rec/s   avgFreeMem  avgTotalMem
 [java] td-00_100_100   addDocument 1   2000    472.0321    4493681 22611558
 [java] td-00_100_100   optimize    1   1   2.857143    4229488 22716416
 [java] td-00_100_100   qd-0110-warm    1   2000    40000.0 4250992 22716416
 [java] td-00_100_100   qd-0110-srch    1   1   Infinity    4221288 22716416

... [java] td-00_100_100 qd-4110-srch 1 1 Infinity 3993624 22716416 [java] td-00_100_100 qd-4110-trav 1 0 NaN 3993624 22716416 [java] td-00_100_100 qd-4111-warm 1 2000 50000.0 3853192 22716416 ... BUILD SUCCESSFUL Total time: 1 minute 0 seconds

I think the "infinity" and "NAN" are caused by op time too short for divide-by-sec. This can be avoided by modifying getRate() in TimeData: public double getRate() { double rps = (double) count * 1000.0 / (double) (elapsed>0 ? elapsed : 1); return rps; }

I like much the logic of loading test data from the Web, and the scaleUp and maximumDocumentsToIndex params are handy.

It seems that all the test logic and some of its data (queries) are java coded. I initially thought of a setting where we define tasks/jobs that are parameterized, like:

createIndex(params)
writeToIndex(params):
addDocs()
optimize()
readFromIndex(params):
searchIndex()
fetchData()

..and compose a test by an XML that says which of these simple jobs to run, with what params, in which order, serial/parallel, how long/often etc. Then creating a different test is as easy as creating a different XML that configures that test.

On the other hand, chances are, I know, that most useful cases would be those already defined here - standard and micro-standard, so can ask "why bothering changing to define these building blocks". I am not sure here, but thought I'll bring it up.

About Using the driver - seems nice and clean to me. I don't know the Digester but it seems to read the config from the XML correctly.

Other comments:

I think there is a redundant call to params.showRunData(params.getId()) in runBenchmark(File,Options);
Seems that rec/sec would be a bit more accurately computed by aggregating elapsed times (instead of rate) in showRunData()
If TimeData not found (only memData) I think additional 0.0 should be printed
columns allignments with tabs and floats is imperfect.:-)
It would be nice I think to also get a summary of the results by "task" - e.g. srch, optimize, something like: [java] # testData id operation runCnt recCnt rec/s avgFreeMem avgTotalMem [java] warm 60 2000 42,628.8 8,235,758 23,048,192 [java] srch 120 1 571.4 8,300,613 23,048,192 [java] optimize 1 1 2.9 9,375,732 23,048,192 [java] trav 120 107 30,517.8 8,326,046 23,048,192 [java] addDocument 1 2000 441.8 7,310,929 22,206,872

Attached timedata.zip has modifies TimeData.java and TestData.java for [1 to 5] above, and for the NAN/inifinite.

asfimport commented 17 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

1st run downloaded the documents from the Web before starting to index. 2nd run started right off - as input docs are already in place - great.

Seems the only output is what is printed to stdout, right?

GSI: The Benchmarker interface does return the TimeData, so other implementations, etc. could use the results programmatically.

I like much the logic of loading test data from the Web, and the scaleUp and maximumDocumentsToIndex params are handy.

It seems that all the test logic and some of its data (queries) are java coded. I initially thought of a setting where we define tasks/jobs that are parameterized, like:

createIndex(params)
writeToIndex(params):
addDocs()
optimize()
readFromIndex(params):
searchIndex()
fetchData()

GSI: I definitely agree that we want a more flexible one to meet people's benchmarking needs. I wanted at least one test that is "standard" in that you can't change the parameters and test cases, so that we can all be on the same page on a run. Then, when people are having discussions on performance they can say "I ran the standard benchmark before and after and here are the results" and we all know what they are talking about. I think all the components are there for a parameterized version, all it takes is someone to extend the Standard one or implement there own that reads in a config file. I will try to put in a fully parameterized version soon.

GSI: Thanks for the fixes, I will incorporate into my version and post another patch soon.

asfimport commented 17 years ago

Doron Cohen (migrated from JIRA)

I looked at extending the benchmark with:

different test "scenarios", i.e. other sequences of operations.
multithreaded tests, e.g. several queries in parallel.
rate of events, e.g. "2 queries arriving per second", or "one query per second in parallel with 20 new documents in a minute".
different data sources (input documents, queries).

For this I made lots of changes to the benchmark code, using parts of it and rewriting other parts. I would like to submit this code in a few days - it is running already but some functionality is missing.

I would like to describe how it works to hopefully get early feedback.

There are several "basic tasks" defined - all extending an (abstract) class PerfTask:

AddDocTask
OptimizeTask
CreateIndexTask etc.

To further extend the benchmark 'framework', new tasks can be added. Each task must implement the abstract method: doLogic(). For instance, in AddDocTask this method (doLogic) would call indexWriter.addDocument(). There are also setup() and tearDown() methods for performing work that should not be timed for that task.

A special TaskSequence task contains other tasks. It is either parallel or sequential, which tells if it executes its child tasks serially or in parallel. TaskSequence also supports "rate": the pace in which its child tasks are "fired" can be controlled.

With these tasks, it is possible to describe a performance test 'algorithm' in a simple syntax. ('algorithm' may be too big a word for this...?)

A test invocation takes two parameters:

test.properties - file with various config properties.
test.alg - file with the algorithm.

By convention, for each task class "OpNameTask", the command "OpName" is valid in test.alg.

Adding a single document is done by: AddDoc

Adding 3 documents: AddDoc AddDoc AddDoc

Or, alternatively: { AddDoc } : 3

So, '{' and '}' indicate a serial sequence of (child) tasks.

To fire 100 queries in a row: { Search } : 100

To fire 100 queries in parallel: [ Search ] : 100

So, '[' and ']' indicate a parallel group of tasks.

To fire 100 queries in a row, 2 queries per second (120 per minute): { Search } : 100 : 120

Similar, but in parallel: [ Search ] : 100 : 120

A sequence task can be named for identifying it in reports: { "QueriesA" Search } : 100 : 120

And there are tasks that create reports.

There are more tasks, and more to tell on the alg syntax, but this post is already long..

I find this quite powerful for perf testing. What do you (and you) think?

Doron

asfimport commented 17 years ago

Doron Cohen (migrated from JIRA)

I am attaching a sample tiny.* - the .alg and .properties files I currently use - I think they may help to understand how this works.

asfimport commented 17 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

OK, how about I commit my changes, then you can add a patch that shows your ideas?

asfimport commented 17 years ago

Doron Cohen (migrated from JIRA)

Sounds good.

In this case I will add my stuff under a new package: org.apache.lucene.benchmark2. (this package would have no dependencies in org.apache.lucene.benchmark.). I will also add tarkets in buid.xml, and add .alg, and .alg files under conf. Makes sense?

Do you already know when you are going to commit it?

asfimport commented 17 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

I'm not a big fan of tacking a number on to the end of Java names, as it doesn't let you know much about what's in the file or package. How about ConfigurableBenchmarker or PropertyBasedBenchmarker or something along those lines, since what you are proposing is a property based one. I think it can just go in the benchmark package or you could make a sub package under there that is more descriptive.

I will try to commit tonight or tomorrow morning.

asfimport commented 17 years ago

Doron Cohen (migrated from JIRA)

Good point on names with numbers - I'm renaming the package to taskBenchmark, as I think of it as "task sequence" based, more than as propetries based.

asfimport commented 17 years ago

Doron Cohen (migrated from JIRA)

Would be nice to get some feedback on what I already have at this point for the "task based benchmark framework for Lucene".

So I am packing it as a zip file. I would probably resubmit as a patch when Grant commits the current benchmark code. See attached taskBenchmark.zip.

To try out taskBenchmark, unzip under contrib/benchmark, on top of Grant's benchmark.patch. This would do 3 changes:

replace build.xml - only change there is adding two targets: run-task-standard and run-task-micro-standard.
add 4 new files under conf:
- task-standard.properties
- task-standard.alg
- task-micro-standard.properties
- task-micro-standard.alg
add a src package 'taskBenchmark' side by side with current 'benchmark' package.

To try it out, go to contrib/benchmark and try 'ant run-task-standard' or 'ant run-task-micro-standard'.

See inside the .alg files for how a test is specified.

The algorithm syntax and the entire package is documented in the package javadoc for taskBenchmark (package.html).

Regards, Doron

asfimport commented 17 years ago

Doron Cohen (migrated from JIRA)

Attached taskBenchmark.zip as described earlier.

asfimport commented 17 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

Committed the benchmark patch plus Doron's update to TestData and TimeData

asfimport commented 17 years ago

Doron Cohen (migrated from JIRA)

I am attaching benchmark.byTask.patch - to be applied in the contrib/benchmark directory.

Root package of byTask classes was modified to org.apache.lucene.benchmark.byTask, in the lines of Grant's suggestion - seems better cause it keeps all benchmark classes under lucene.benchmark.

I added one a sample .alg under conf and added some documentation.

Entry point - documentation wise - is the package doc for org.apache.lucene.benchmark.byTask.

Thanks for any comments on this!

PS. Before submitting the patch file, I tried to apply it myself on a clean version of the code, just to make sure that it works. But I got errors like this – Could not retrieve revision 0 of "...\byTask.." – for every file under a new folder. So I am not sure if it is just my (Windows) svn patch applying utility, or is it really impossible to apply a patch that creates files in (yet) nonexistent directories. I searched Lucene mailing lists and SVN mailing lists and went again through the SVN book again but nowhere could I find what is the expected behavior for applying a patch containing new directories. In fact, "svn diff" would not even show you files that are new (again, this is the Windows svn 1.4.2 version). (I used Tortoise SVN to create the patch). This is rather annoying and I might be misunderstanding something basic about SVN, but I thought it'd be better to share this experience here - might save some time for others trying to apply this patch or other patches...

asfimport commented 17 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

Doron,

When I apply your patch, I am getting strange errors. It seems to go through cleanly, but then the new files (for instance, byTask.stats.Report.java) has the whole file occurring twice in each file, thus causing duplicate class exceptions. This happens for all the files in the byTask package. The changes in the other files apply cleanly.

I applied the patch as: patch -p0 -i <patch file> as I always do on a clean version.

I suspect that your last comment may be at the root of the issue. Can you try applying this again to a clean version and see if you still have issues or whether it is something I am missing? Can you regenerate this patch, perhaps using a command line tool? Looking at the patch file, I am not sure what the issue is.

Otherwise, based on the documentation, this sounds really interesting and useful. Based on some of your other patches, I assume you are using this to do benchmarking, no?

Thanks, Grant

asfimport commented 17 years ago

Doron Cohen (migrated from JIRA)

Grant, thanks for trying this out - I will update the patch shortly. I am using this for benchmarking - quite easy to add new stuff - and in fact I added some stuff lately but did not update here because wasn't sure if others are interested. I will verify what I have with svn head and pack it here as an updated patch. Regards, Doron

asfimport commented 17 years ago

Doron Cohen (migrated from JIRA)

This update of the byTask package includes:

allowing to tailor a perf test "programmically" (without an .alg file).
maintaining both the "algorithm" and the run-properties in a single .alg file - this is easier to maintain in my opinion.
some code cleanup.
build.xml has a single "task related" target now: run-task. an ant property is used to invoke other .alg files.
documentation updated (package docs under byTask).

To apply the patch from the trunk dir: patch -p0 -i <byTask.2.patch.txt> To test it, cd to contrib/benchmark and type: ant run-task

Grant, I noticed that the patch file contains EOL characters - Unix/DOS thing I guess. But 'patch' works cleanly for me either with these characters or without them, so I am leaving these characters there. I hope this patch applies cleanly for you.

asfimport commented 17 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

Hey Doron,

Your patch uses JDK 1.5. I am assuming it is safe to use Class.getName in place of Class.getSimpleName, right? I think once I do that plus change the String.contains calls to String.indexOf it should all be fine, right? I have it compiling and running, so that is a good sign. I will look to commit soon.

-Grant

asfimport commented 17 years ago

Doron Cohen (migrated from JIRA)

Oops... I had the impression that compiling with compliance level 1.4 is sufficient to prevent this, but guess I need to read again what that compliance level setting guarantees exactly.

Anyhow there are a 3 things that require 1.5:

Boolean.parseBoolean() --> Boolean.valueOf().booleanValue()
String.contains() --> indexOf()
Class.getSimpleName() --> ?

Modifying Class.getSimpleName() to Class.getName() would not be very nice - queries prints and task names prints would be quite ugly. To fix that I added a method simpleName(Class) to byTask.util.Format. I am attaching an updated patch - byTask.jre1.4.patch.txt - that includes this method and removes the Java 1.5 dependency.

Thanks for catching this! Doron

asfimport commented 17 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

Doron,

I have committed your additions. This truly is great stuff. Thank you so much for contributing. The documentation (code and package level) is well done, the output is very readable. The alg language is a bit cryptic and takes a little deciphering, but you do document it quite nicely. I like the extendability factor and I think it will make it easier for people to contribute benchmarking capabilities.

I would love to see someone mod the reporting mechanism in the future to allow for printing info to something other than System.out, as I know people have expressed interest in being able to slurp the output into Excel or similar number crunching tools. This could also lead to the possibility of running some of the algorithms nightly and then integrating with JUnitPerf or some other performance unit testing approach.

We may want to consider deprecating the other benchmarking stuff, although, I suppose it can't hurt to have multiple opinions in this area.

At any rate, this is very much appreciated. I would encourage everyone who is interested in benchmarking to take a look and provide feedback. I'm going to mark this bug as finished for now as I think we have a good baseline for benchmarking at this point.

Thanks again, Grant

asfimport commented 17 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

Have committed a baseline benchmarking suite thanks to Doron and Andrzej. Bugs can now be opened specific to the code in the contrib area.

asfimport commented 17 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

This has been committed and is available for use. New issues can be opened on specific problems.

asfimport commented 14 years ago

Marvin Humphrey (migrated from JIRA)

During the course of a recent IP audit, I determined that two out of three files I contributed to LUCENE-675 back in 2006 were in fact based on an original written by Murray Walker: LuceneIndexer.java and BenchmarkingIndexer.pm. (The third file, "extract_reuters.plx", was my own work as advertised.)

Murray has graciously expressed a willingness to license his work to Apache, but since the files in question were not used, the consensus opinion is that it would be best to delete them. For further reference, see the legal-discuss@a.o archives: <http://markmail.org/message/4esu3owjxft5n2f7>.

I feel very fortunate that the problematic contributions were not integrated into Lucene and that it was the work of an eminently reasonable solo author whose work was inadvertently contributed without permission. I apologize to Murray and to the Lucene community for my errors.

apache / lucene

Lucene benchmark: objective performance test for Lucene [LUCENE-675] #1750