Open GoogleCodeExporter opened 9 years ago
Hi,
Just want to check a couple of things. How long did you wait for the tokenizer
threads to finish?
It might take them a while to finish all the job even after the all the
articles have been added to queue.
Could you please try waiting for 15+ minutes? Please notify if that does not
work.
Volodymyr
Original comment by halle...@gmail.com
on 10 Nov 2008 at 11:55
Hi,
I have waited for 3+ hours .
But It didn't work.
chongyc27
Original comment by chongyc27@gmail.com
on 10 Nov 2008 at 12:21
OK, I've found the issue with jawikibooks and jawikinews, however I'm not
knowledgeable enough to fix.
The issue lies in the CJK tokenizer implemented in Lucene, basically
Chinese/Japanese/Korean languages need special tokenizing rules due to the
actual
structures of the language. A single token in those languages is usually a
single
hieroglyph. Hence, the indexing becomes much slower and the index becomes much
bigger.
Unfortunately I don't know the Chinese/Japanese/Korean languages so I can't
really
see how to optimize the indexing process.
Original comment by halle...@gmail.com
on 29 Nov 2008 at 10:18
I don't know any CJKV language either, and can't help with the tokenizing
itself, but
I'm working on adding an "ETA" feature to the lengthy indexing process, to ease
the
frustration. More on this when I've tested this some more.
Original comment by asaf.bartov
on 3 Dec 2008 at 7:46
The ETA and progress bar update is committed, so now large wikipedia dumps
would be
less frustrating during the initial indexing. Note that an official binary is
not
yet available.
Original comment by asaf.bartov
on 7 Dec 2008 at 12:48
If bzreader use CSharp-binding of xapian but Lucene.NET as indexer backend, it
will
work well in korean, japanese languages.
Also xapian is faster than lucene.
Original comment by chongyc27@gmail.com
on 16 Jan 2009 at 9:54
Original issue reported on code.google.com by
chongyc27@gmail.com
on 10 Nov 2008 at 10:35Attachments: