Cannot index jawikibooks and jawikinews

davideuler / bzreader

Automatically exported from code.google.com/p/bzreader

0 stars 0 forks source link

Cannot index jawikibooks and jawikinews #6

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. Opened jawikibooks-20080617-pages-meta-current.xml.bz2 in BzReader
2. BzReader go to infinite waiting state.

What is the expected output? What do you see instead?
See attach image file

What version of the product are you using? On what operating system?
Version : BzReader 1.07
OS :      Windows XP SP3
File :    jawikibooks-20080617-pages-meta-current.xml.bz2
          jawikinews-20080617-pages-meta-current.xml.bz2

Original issue reported on code.google.com by chongyc27@gmail.com on 10 Nov 2008 at 10:35

Attachments:

BzReader-1.07-bug.JPG

GoogleCodeExporter commented 9 years ago

Hi,

Just want to check a couple of things. How long did you wait for the tokenizer 
threads to finish?
It might take them a while to finish all the job even after the all the 
articles have been added to queue.

Could you please try waiting for 15+ minutes? Please notify if that does not 
work.

Volodymyr

Original comment by halle...@gmail.com on 10 Nov 2008 at 11:55

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

Hi,

I have waited for 3+ hours .

But It didn't work.

chongyc27

Original comment by chongyc27@gmail.com on 10 Nov 2008 at 12:21

GoogleCodeExporter commented 9 years ago

OK, I've found the issue with jawikibooks and jawikinews, however I'm not
knowledgeable enough to fix. 

The issue lies in the CJK tokenizer implemented in Lucene, basically
Chinese/Japanese/Korean languages need special tokenizing rules due to the 
actual
structures of the language. A single token in those languages is usually a 
single
hieroglyph. Hence, the indexing becomes much slower and the index becomes much 
bigger. 

Unfortunately I don't know the Chinese/Japanese/Korean languages so I can't 
really
see how to optimize the indexing process.

Original comment by halle...@gmail.com on 29 Nov 2008 at 10:18

Changed title: Cannot index jawikibooks and jawikinews
Changed state: Started
Added labels: Performance

GoogleCodeExporter commented 9 years ago

I don't know any CJKV language either, and can't help with the tokenizing 
itself, but
I'm working on adding an "ETA" feature to the lengthy indexing process, to ease 
the
frustration.  More on this when I've tested this some more.

Original comment by asaf.bartov on 3 Dec 2008 at 7:46

GoogleCodeExporter commented 9 years ago

The ETA and progress bar update is committed, so now large wikipedia dumps 
would be
less frustrating during the initial indexing.  Note that an official binary is 
not
yet available.

Original comment by asaf.bartov on 7 Dec 2008 at 12:48

GoogleCodeExporter commented 9 years ago

If bzreader use CSharp-binding of xapian but Lucene.NET as indexer backend, it 
will
work well in korean, japanese languages.

Also xapian is faster than lucene.

Original comment by chongyc27@gmail.com on 16 Jan 2009 at 9:54