apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.58k stars 1.01k forks source link

building a kuromoji dictionary is very slow and eventually fails if you use java 5 [LUCENE-3696] #4770

Closed asfimport closed 12 years ago

asfimport commented 12 years ago

Note: This only affects you if you use java 5 on 3.x, and it only affects you if you want to download/rebuild the dictionary. the analyzer itself works fine on 3.x with java 5.

With java 6, building a kuromoji dictionary is quite fast:

     [java] building tokeninfo dict...
     [java]   parse...
     [java]   sort...
     [java]   encode...
     [java]   53645 nodes, 253185 arcs, 1954817 bytes...   done
     [java] done
     [java] building unknown word dict...done
     [java] building connection costs...done

BUILD SUCCESSFUL
Total time: 6 seconds

However, if you use java 5, it takes forever and eventually runs out of memory in the CSV parsing phase. So we might need to optimize the CSV parser (like precompile its patterns).

     [java] building tokeninfo dict...
     [java]   parse...
     [java] Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
     [java]     at java.util.regex.Pattern.newSlice(Pattern.java:2909)
     [java]     at java.util.regex.Pattern.atom(Pattern.java:1898)
     [java]     at java.util.regex.Pattern.sequence(Pattern.java:1794)
     [java]     at java.util.regex.Pattern.expr(Pattern.java:1687)
     [java]     at java.util.regex.Pattern.compile(Pattern.java:1397)
     [java]     at java.util.regex.Pattern.<init>(Pattern.java:1124)
     [java]     at java.util.regex.Pattern.compile(Pattern.java:817)
     [java]     at java.lang.String.replaceAll(String.java:2000)
     [java]     at org.apache.lucene.analysis.kuromoji.util.CSVUtil.unQuoteUnEscape(CSVUtil.java:84)
     [java]     at org.apache.lucene.analysis.kuromoji.util.CSVUtil.parse(CSVUtil.java:55)
     [java]     at org.apache.lucene.analysis.kuromoji.util.TokenInfoDictionaryBuilder.buildDictionary(TokenInfoDictionaryBuilder.java:96)
     [java]     at org.apache.lucene.analysis.kuromoji.util.TokenInfoDictionaryBuilder.build(TokenInfoDictionaryBuilder.java:76)
     [java]     at org.apache.lucene.analysis.kuromoji.util.DictionaryBuilder.build(DictionaryBuilder.java:37)
     [java]     at org.apache.lucene.analysis.kuromoji.util.DictionaryBuilder.main(DictionaryBuilder.java:82)

BUILD FAILED
/home/rmuir/workspace/lucene-branch3x2/lucene/contrib/analyzers/kuromoji/build.xml:75: Java returned: 1

Total time: 2 minutes 4 seconds

Migrated from LUCENE-3696 by Robert Muir (@rmuir), resolved Jan 16 2012 Attachments: LUCENE-3696.patch (versions: 2)

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Here's a quick fix: just using replace() instead of replaceAll() and using 1GB -Xmx instead of 512MB.

now it builds correctly on java 5. Using 1GB is not ideal but I think necessary if you are using a 64 bit java 5 like me?

We could later try to optimize the dictionary construction to use less RAM so we can lower this (I have some ideas)

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

With the patch:

     [java] building tokeninfo dict...
     [java]   parse...
     [java]   sort...
     [java]   encode...
     [java]   53645 nodes, 253185 arcs, 1954817 bytes...   done
     [java] done
     [java] building unknown word dict...done
     [java] building connection costs...done

BUILD SUCCESSFUL
Total time: 10 seconds
asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

updated patch, just optimizing the CSV stuff to make less garbage.

I will commit this soon (bumping to Xmx756m in case someone uses java5)