Add supported for Wikipedia English as a corpus in the benchmarker stuff [LUCENE-848]

asfimport commented 17 years ago

Add support for using Wikipedia for benchmarking.

Migrated from LUCENE-848 by Steven Parkes, 1 vote, resolved Jul 08 2007 Attachments: LUCENE-848.txt (versions: 7), LUCENE-848-build.patch, WikipediaHarvester.java, xerces.jar (versions: 2), xml-apis.jar Linked issues:

2015

asfimport commented 17 years ago

Steven Parkes (migrated from JIRA)

Sorry; it's not a major thing.

asfimport commented 17 years ago

Karl Wettin (migrated from JIRA)

There is some code in #1901. Here is a newer version.

asfimport commented 17 years ago

Steven Parkes (migrated from JIRA)

Can't leave the typo in the title. It's bugging me.

Karl, it looks like your stuff grabs individual articles, right? I'm gong to have it download the bzip2 snapshots they provide (and that they prefer you use, if you're getting much).

Question (for Doron and anyone else): the file is xml and it's big, so DOM isn't going to work. I could still use something SAX based but since the format is so tightly controlled, I'm thinking regular expressions would be sufficient and have less dependences. Anyone have opinions on this?

asfimport commented 17 years ago

Karl Wettin (migrated from JIRA)

> Karl, it looks like your stuff grabs individual articles, right? I'm gong to have it download the bzip2 snapshots they provide (and that they prefer you use, if you're getting much).

They also supply the rendered HTML every now and then. It should be enough to change the URL pattern to file:///tmp/wikipedia/. I was considering porting the MediaWiki BNF as a tokenizer, but found it much simpler to just parse the HTML.

asfimport commented 17 years ago

Steven Parkes (migrated from JIRA)

This patch is a first cut a wikipedia benchmark support. It downloads the current english pages from the Wikipedia download site ... which, of course, is actually not there right now. I'm not quite sure what's up, but you can find the files at http://download.wikimedia.org/enwiki/20070402/ right now if you want to play.

It adds ExtractWikipedia.java, which uses Xerces-J to grab the individual articles. It writes the articles in the same format as the Reuters stuff, so a generecised ReutersDocMaker, DirDocMaker, works.

The current size of the download file is 2.1G bzip2'd. It's supposed to contain about 1.2M documents but I came out with 2 or 3, I think, so there maybe "extra" files in there. (Some entries are links and I tried to get rid of those, but I may have missed a particular coding or case).

For the first pass, I copied the Reuters steps of decompressing and parsing. This creates big temporary files. Moreover, it creates a big directory tree in the end. (The extractor uses a fixed number of documents per directory and grows the depth of the tree logarithmically, a lot like Lucene segments).

It's not clear how this preprocessing-to-a-directory-tree compares to on the fly decompression, which would require less disk seeks on the input during indexing. May try that at some point ...

asfimport commented 17 years ago

Steven Parkes (migrated from JIRA)

By the way, that's a rough patch. I'm cleaning it up as I use it to test 847.

Also, I was going to add support to the algorithm format for setting max field length ...

asfimport commented 17 years ago

Doron Cohen (migrated from JIRA)

> Also, I was going to add support to the algorithm format for setting max field length ...

If this means extending the algorithm language, it would be simpler to just base on a property here - in the alg file set that property - "max.field.length=20000" - and then in OpenIndexTask read that new property (see how merge.factor property is read) and set it on the index.

asfimport commented 17 years ago

Steven Parkes (migrated from JIRA)

That's what I meant (and did).

If it's okay, I'll bundle it into 848.

asfimport commented 17 years ago

Doron Cohen (migrated from JIRA)

Seems okay to me (since it's all in the benchmark).

asfimport commented 17 years ago

Steven Parkes (migrated from JIRA)

Update of the previous patch. Used Doron's suggestion for variable name. Cleaned up a little (reverted the eol style on build.txt so the diff makes sense; see #1939 to for fixing the eol-styles in contrib/benchmark.

Right now the test algorithm is wikipedia.alg but I think the idea is to create specific benchmarks, so maybe this should be something like ingest-enwiki meaning a test of ingest rate against wikipedia.

asfimport commented 17 years ago

Steven Parkes (migrated from JIRA)

Blah. This patch doesn't work quite right with 1.4. My intention was/is to use xerces to do the xml parsing but the setup doesn't work quite right under 1.4 which has some crimson stuff in rt.jar that I don't (yet) understand.

asfimport commented 17 years ago

Steven Parkes (migrated from JIRA)

Okay, I've tested this patch against 1.4, 1.5, and 1.6. I've added the xerces lib since we're including other required support jars in lib.

asfimport commented 17 years ago

Steven Parkes (migrated from JIRA)

Here's the version of xerces that I used, to go in contrib/benchmark/lib (svn diff seems to eat binary files).

asfimport commented 17 years ago

Steven Parkes (migrated from JIRA)

Upgrade to Xerces 2. Xerces 1 passes the sanity check, but fails for wikipedia, evidently because of >2G files.

In addition to patch, requires xerces.jar and xml-apis.jar.

asfimport commented 17 years ago

Steven Parkes (migrated from JIRA)

Now I see the button for attach multiple files. Oh, well.

Anyway, both jars go in contrib/benchmark/lib.

asfimport commented 17 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

Steven, is this ready to go in your opinion? If so, I will take a look at it and try to add it this week.

asfimport commented 17 years ago

Steven Parkes (migrated from JIRA)

yeah; think so; it worked for my benchmarking stuff on a couple of systems; might have some things others discover, but that's always true

asfimport commented 17 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

Hi Steven,

Do you know what version of Xerces and xml-apis these are? I can add the version onto them when I check them in.

asfimport commented 17 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

I'm getting: Getting: http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 [get] To: /Users/grantingersoll/projects/lucene/Lucene-Trunk/contrib/benchmark/temp/enwiki-latest-pages-articles.xml.bz2 [get] Error opening connection java.io.FileNotFoundException: http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 [get] Error opening connection java.io.FileNotFoundException: http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 [get] Error opening connection java.io.FileNotFoundException: http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 [get] Can't get http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 to <mydir>/contrib/benchmark/temp/enwiki-latest-pages-articles.xml.bz2

asfimport commented 17 years ago

Steven Parkes (migrated from JIRA)

Both jars are from xerces-2.9.0.

asfimport commented 17 years ago

Doron Cohen (migrated from JIRA)

I haven't tried this patch yet - hesitated/thinking it must take very long to download the huge start-up data (is this correct?)... anyhow I was wondering abut the new jars - whether we should try to make xcerses and xml-apis jars "ext-jars", i.e. downloaded from somewhere (where?) only when attempting to use this package. Otherwise this is adding \~2.5MB to the checkout/dev-pack - do others consider this an issue at all?

asfimport commented 17 years ago

Steven Parkes (migrated from JIRA)

Yeah, it takes a while to download.

I added the jars since that's what we've been doing elsewhere. In fact, xerces is in gdata-server too. Personally, the size isn't an issue for me; don't know about others. What might be difficult, though, is trying to share the two since that would mean coordinating contrib projects, and I don't know anything about the gdata server. I can tell you that if you want to support both 1.4 and 1.5 on something as big wikipedia, there is sensitivity to the xerces revision.

Sorry about the download problem, Grant. I actually documented that in a readme ... hat I can no longer find. I would swear I put it in the patch but obviously I didn't becuase it's not there. Now I have to go find it.

The short answer is you want to download http://download.wikimedia.org/enwiki/20070402/enwiki-20070402-pages-articles.xml.bz2. The wikipedia download site isn't always clean, doesn't have files where they "should" be. It was when I first started this, but isn't now.

asfimport commented 17 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

+1 Not a big deal to go get the files via an ANT task. Of course,
this could stir the whole maven/ivy debate once again :-)

The other question is whether there are common libraries sprinkled
throughout contrib that it might make sense to create a contrib/lib
Of course, then you would have to figure out what versions to
support, etc. Aaah, Maven, just put it in the POM... :-)

Grant Ingersoll Center for Natural Language Processing http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ LuceneFAQ

asfimport commented 17 years ago

Steven Parkes (migrated from JIRA)

Here's the patch with the README.

By the way, there's also a .rsync-filter in the patch. I never described that. If you use rsync, there's an option where it will look for these filter files and not rsync files/directories as spec'd in the file.

Since I sometime rsync working copies around to test on different machines, and since I don't want to try to copy around wikipedia (or the other datasets), I "spec" those out.

Without the appropriate rsync option, the files are ignored, so I would think this would be a good thing to have ...

asfimport commented 17 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Friendly reminder: the latest patch looks like it still has some cancerous whitespace in it!

asfimport commented 17 years ago

Steven Parkes (migrated from JIRA)

Well, here's a version with less whitespace.

But, I have to admit, removing it turned out to be more difficult than I thought it would be. I may have gone too far. It's hard for me to judge "benign" ("as long as it doesn't hurt readability") for obvious reasons.

asfimport commented 17 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Alas I fear you did not go quite far enough; there's still lots of extra whitespace around ()'s and []'s.

For example I think source like this:

if ( qualified.equals( "title" ) ) {

should look like this instead:

if (qualified.equals("title")) {

asfimport commented 17 years ago

Steven Parkes (migrated from JIRA)

Ath. That would be because I was thinking vertically, not horizontally.

Would this be reasonably normative? http://java.sun.com/docs/codeconv/html/CodeConventions.doc7.html#475

asfimport commented 17 years ago

Steven Parkes (migrated from JIRA)

Close to http://java.sun.com/docs/codeconv/html/CodeConventions.doc7.html#475. Within normal Lucene differences, I believe.

asfimport commented 17 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Ahhh, that looks great Steve. Thanks.

asfimport commented 17 years ago

Doug Cutting (@cutting) (migrated from JIRA)

Yes, the standard for Lucene Java (as specified in http://wiki.apache.org/jakarta-lucene/HowToContribute) is Sun's except 2-space indentation.

asfimport commented 17 years ago

Michael Busch (migrated from JIRA)

I'm not familiar with this patch but looking at the recent comments it looks ready to commit?

asfimport commented 17 years ago

Steven Parkes (migrated from JIRA)

Grant was looking at hosting a copy of the dataset on zones so that we'd have a fixed dataset which would enable more repeatable experiments. If that happened, we could update code/readme to point there rather than fetching things from wikipedia, where things are always changing (and not always there).

asfimport commented 17 years ago

Michael Busch (migrated from JIRA)

OK I see, that makes sense. I think we can clear the fix version here?

asfimport commented 17 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

is there any reason not to host these on lucene.apache.org instead of the zone?

I ask this assuming the barrier is setting up a webserver on the zone to host the file and not any remainining legal issues since those seem to have been ok'd...

http://www.nabble.com/Fwd%3A-Wikipedia-content%2C-GNU-Free-Documentation-License-and-Apache-p10182964.html

asfimport commented 17 years ago

Steven Parkes (migrated from JIRA)

I'll leave the hosting site to others; I don't know enough about apache infra.

If the hosting got decided before 2.2 got cut, that'd be great, but I certainly don't think it's worth holding up the release for.

asfimport commented 17 years ago

Michael Busch (migrated from JIRA)

Alright. Clearing the fix version to not block 2.2.

asfimport commented 17 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

I think the zone is the preferred place, since that is for developer
resources. Since this isn't in the main line of testing, it probably
won't be downloaded all that much.

Steven, do we have a final version that you can point me at that you
want hosted?

Grant Ingersoll Center for Natural Language Processing http://www.cnlp.org/tech/lucene.asp

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ LuceneFAQ

asfimport commented 17 years ago

Steven Parkes (migrated from JIRA)

It looks like the latest successful dump is http://download.wikimedia.org/enwiki/20070527/enwiki-20070527-pages-articles.xml.bz2

If you copy it whereever, I'll fetch it from there and test it.

asfimport commented 17 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

OK, I applied the patch and am testing this. I updated the build file to point to http://people.apache.org/\~gsingers/wikipedia/enwiki-20070527-pages-articles.xml.bz2

asfimport commented 17 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

I am getting the following when I apply this patch: Exception in thread "main" java.lang.RuntimeException: org.xml.sax.SAXParseException: Element type "Pat" must be followed by either attribute specifications, ">" or "/>". [java] at org.apache.lucene.benchmark.utils.ExtractWikipedia.extract(ExtractWikipedia.java:184) [java] at org.apache.lucene.benchmark.utils.ExtractWikipedia.main(ExtractWikipedia.java:199) [java] Caused by: org.xml.sax.SAXParseException: Element type "Pat" must be followed by either attribute specifications, ">" or "/>". [java] at org.apache.xerces.framework.XMLParser.reportError(XMLParser.java:1213) [java] at org.apache.xerces.framework.XMLDocumentScanner.reportFatalXMLError(XMLDocumentScanner.java:579) [java] at org.apache.xerces.framework.XMLDocumentScanner.abortMarkup(XMLDocumentScanner.java:628) [java] at org.apache.xerces.framework.XMLDocumentScanner.scanElement(XMLDocumentScanner.java:1800) [java] at org.apache.xerces.framework.XMLDocumentScanner$ContentDispatcher.dispatch(XMLDocumentScanner.java:1182) [java] at org.apache.xerces.framework.XMLDocumentScanner.parseSome(XMLDocumentScanner.java:381) [java] at org.apache.xerces.framework.XMLParser.parse(XMLParser.java:1098) [java] at org.apache.lucene.benchmark.utils.ExtractWikipedia.extract(ExtractWikipedia.java:181)

asfimport commented 17 years ago

Steven Parkes (migrated from JIRA)

Let me see if I can replicate.

Can you do a sha1sum on your enwiki-20070527-pages-articles.xml.bz2 so I can be sure my copy is valid?

Mine's 263f94e857882e4a379ac60372201467e343db50

asfimport commented 17 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

OK, I downloaded a fresh copy. sha1sum on the bz2 file yields: 76402fed3b6f6583aa283db5dbbba83abbf65d74 when downloaded from http://people.apache.org/\~gsingers/wikipedia/enwiki-20070527-pages-articles.xml.bz2

ls -l yields: ... 1778897799 ... enwiki-20070527-pages-articles.xml ... 477278208 ... enwiki-20070527-pages-articles.xml.bz2

Now my error is [java] Exception in thread "main" java.lang.RuntimeException: org.xml.sax.SAXParseException: Attribute name "Td" must be followed by the '=' character. [java] at org.apache.lucene.benchmark.utils.ExtractWikipedia.extract(ExtractWikipedia.java:184) [java] at org.apache.lucene.benchmark.utils.ExtractWikipedia.main(ExtractWikipedia.java:199) [java] Caused by: org.xml.sax.SAXParseException: Attribute name "Td" must be followed by the '=' character. [java] at org.apache.xerces.framework.XMLParser.reportError(XMLParser.java:1213) [java] at org.apache.xerces.framework.XMLDocumentScanner.reportFatalXMLError(XMLDocumentScanner.java:598) [java] at org.apache.xerces.framework.XMLDocumentScanner.abortMarkup(XMLDocumentScanner.java:636) [java] at org.apache.xerces.framework.XMLDocumentScanner.scanElement(XMLDocumentScanner.java:1761) [java] at org.apache.xerces.framework.XMLDocumentScanner$ContentDispatcher.dispatch(XMLDocumentScanner.java:1182) [java] at org.apache.xerces.framework.XMLDocumentScanner.parseSome(XMLDocumentScanner.java:381) [java] at org.apache.xerces.framework.XMLParser.parse(XMLParser.java:1098) [java] at org.apache.lucene.benchmark.utils.ExtractWikipedia.extract(ExtractWikipedia.java:181) [java] ... 1 more

asfimport commented 17 years ago

Steven Parkes (migrated from JIRA)

Actually, I just noticed wikimedia provides the md5 hashes. I was able to validate my copy.

I don't actually remember if I got my copy from wikimedia or from p.a.o.

The copy in your ls -l looks bad, both from the sha1sum and from the size. Looks like your file is truncated: the file length is 455M (if 477278208 is the size in bytes) and the real file is 2686431976 (2.6G) bytes.

Can you check the file on p.a.o, both the size and the md5 hash? The latter should be fc24229da9af033cbb55b9867a950431 (http://download.wikimedia.org/enwiki/20070527/enwiki-20070527-md5sums.txt)

I should be able to launch a test of the unzip/extract tonight. It takes a while.

asfimport commented 17 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

Weird, this is the info on p.a.o: ... 2686431976 May 30 02:17 enwiki-20070527-pages-articles.xml.bz2

So, I don't know what is up w/ my download. I am surprised it
uncompressed. p.a.o. doesn't have sha1sum

Anyway, I am trying to download using wget and it lists the file size
at 2.5G, so hopefully this will download.

Grant Ingersoll Center for Natural Language Processing http://www.cnlp.org/tech/lucene.asp

Read the Lucene Java FAQ at http://wiki.apache.org/lucene-java/LuceneFAQ

asfimport commented 17 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

OK, looks like that one went through, using wget. I think I will commit as there must have been something screwed up on my network side.

asfimport commented 17 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

I take back my promise to commit, I am getting (after processing 189500 docs): [java] Error: cannot execute the algorithm! term out of order ("docid:disrs".compareTo("docname:disregardle &*Ar") <= 0) [java] org.apache.lucene.index.CorruptIndexException: term out of order ("docid:disrs".compareTo("docname:disregardle &*Ar") <= 0) [java] at org.apache.lucene.index.TermInfosWriter.add(TermInfosWriter.java:102) [java] at org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:332) [java] at org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:297) [java] at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:261) [java] at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:98) [java] at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:1883) [java] at org.apache.lucene.index.IndexWriter.maybeMergeSegments(IndexWriter.java:1811) [java] at org.apache.lucene.index.IndexWriter.flushRamSegments(IndexWriter.java:1742) [java] at org.apache.lucene.index.IndexWriter.flushRamSegments(IndexWriter.java:1733) [java] at org.apache.lucene.index.IndexWriter.maybeFlushRamSegments(IndexWriter.java:1727) [java] at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1004) [java] at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:983) [java] at org.apache.lucene.benchmark.byTask.tasks.AddDocTask.doLogic(AddDocTask.java:74) [java] at org.apache.lucene.benchmark.byTask.tasks.PerfTask.runAndMaybeStats(PerfTask.java:83) [java] at org.apache.lucene.benchmark.byTask.tasks.TaskSequence.doSerialTasks(TaskSequence.java:107) [java] at org.apache.lucene.benchmark.byTask.tasks.TaskSequence.doLogic(TaskSequence.java:93) [java] at org.apache.lucene.benchmark.byTask.tasks.PerfTask.runAndMaybeStats(PerfTask.java:90) [java] at org.apache.lucene.benchmark.byTask.tasks.TaskSequence.doSerialTasks(TaskSequence.java:107) [java] at org.apache.lucene.benchmark.byTask.tasks.TaskSequence.doLogic(TaskSequence.java:93) [java] at org.apache.lucene.benchmark.byTask.tasks.PerfTask.runAndMaybeStats(PerfTask.java:90) [java] at org.apache.lucene.benchmark.byTask.tasks.TaskSequence.doSerialTasks(TaskSequence.java:107) [java] at org.apache.lucene.benchmark.byTask.tasks.TaskSequence.doLogic(TaskSequence.java:93) [java] at org.apache.lucene.benchmark.byTask.tasks.PerfTask.runAndMaybeStats(PerfTask.java:90) [java] at org.apache.lucene.benchmark.byTask.tasks.TaskSequence.doSerialTasks(TaskSequence.java:107) [java] at org.apache.lucene.benchmark.byTask.tasks.TaskSequence.doLogic(TaskSequence.java:93) [java] at org.apache.lucene.benchmark.byTask.utils.Algorithm.execute(Algorithm.java:228) [java] at org.apache.lucene.benchmark.byTask.Benchmark.execute(Benchmark.java:72) [java] at org.apache.lucene.benchmark.byTask.Benchmark.main(Benchmark.java:108) [java] #################### [java] ### D O N E !!! ### [java] ####################

Can you reproduce this? It seems like an actual issue with core.

asfimport commented 17 years ago

Steven Parkes (migrated from JIRA)

Trying to reproduce now.

Something that came up while restarting the fetch/decompress/etc. was the number of files this procedure creates. It's a lot: one for each article. I used the existing benchmark code for doing this stuff but perhaps it's not a good idea on this scale? For one thing, it kinda kills ant since ant wants to do a walk of subtrees for some of its tasks. Either we need to exclude the work and temp directories from ant's walks and/or we should come up with something better than one file per article.

I think Mike mentioned not doing the one file per article. I'll try to look at that ...

asfimport commented 17 years ago

Doron Cohen (migrated from JIRA)

Steven wrote: > I think Mike mentioned not doing the one file per article. I'll try to look at that ...

Perhaps also (re) consider the "compress and add on-the-fly" approach, similar to what TrecDocmaker is doing?

Grant wrote: > I take back my promise to commit, I am getting (after processing 189500 docs): > [java] Error: cannot execute the algorithm! term out of order ("docid:disrs".compareTo("docname:disregardle > &*Ar") <= 0) > [java] org.apache.lucene.index.CorruptIndexException: term out of order ("docid:disrs".compareTo("docname:disregardle > &*Ar") <= 0)

Just to verify that it is not a benchmark issue, could you also post here the executed algorithm (as printed, or, if not printed, the actual file)...?

asfimport commented 17 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

> I think Mike mentioned not doing the one file per article. I'll try to look at that ...

I'm actually [slowly] working through a patch to contrib/benchmark that adds a LineDocMaker that will open a single file and make one document per line. (This is the follow-through on #1918 to merge in the benchmarking tool that I used there, into contib/benchmark). This is in order to do tests that aren't affected by the time to decompress files/walk trees/open new files/etc. to build their documents.

I will also include in the patch some way to run an existing DocMaker, pull its documents, and store them into a single line file. It's probably still worthwhile to have a DocMaker that can read the single wikipedia XML file and produce documents directly from that to save creating file-per-document in a large dir tree.

apache / lucene

Add supported for Wikipedia English as a corpus in the benchmarker stuff [LUCENE-848] #1923

2015