-
Note: This only affects you if you use java 5 on 3.x, and it only affects you if you want to download/rebuild the dictionary.
the analyzer itself works fine on 3.x with java 5.
With java 6, building…
-
Kuromoji has a segmentation mode for search that uses a heuristic to promote additional segmentation of long candidate tokens to get a decompounding effect. This heuristic has been improved. Patch i…
-
This is an alternative dictionary, somewhat larger (\~25%).
we can support it in build.xml so if a user wants to build with it, they can (the resulting jar file will be 500KB larger)
---
Migrated …
-
Reading thru the ipadic documentation, i realized we are storing a lot of redundant information,
for example the connection costs for bigram weights are based on POS+inflection data, so its redundant …
-
lucene's common-build.xml 'validate' depends on compile-tools, but some
modules like icu, kuromoji, etc have a compile-tools target (for other reasons).
I think it should explicitly depend on common.…
-
The FSTs produced by Builder can be further shrunk if you are willing
to spend highish transient RAM to do so... our Builder today tries
hard not to use much RAM (and has options to tweak down the RAM…
-
Many Japanese katakana words end in a long sound that is sometimes optional.
For example, パーティー and パーティ are both perfectly valid for "party". Similarly we have センター and センタ that are variants of "ce…
-
Not sure we should do this, it costs 5-10% performance for WFSTSuggester.
But maybe we can optimize something here, or maybe its just no big deal to us.
Because in general, this could be pretty power…
-
What would you think of making CompressingStoredFieldsFormat the new default StoredFieldsFormat?
Stored fields compression has many benefits :
- it makes the I/O cache work for us,
- file-based index…
-
I modified BaseTokenStreamTestCase to assert that the start/end
offsets match for graph (posLen > 1) tokens, and this caught a bug in
Kuromoji when the decompounding of a compound token has a punctuat…