atilika / kuromoji

Kuromoji is a self-contained and very easy to use Japanese morphological analyzer designed for search
Apache License 2.0
950 stars 131 forks source link

Android runtime exception when creating new Tokenizer using kuromoji-ipadic #98

Closed hk0i closed 8 years ago

hk0i commented 8 years ago

I implemented some test code on Android but I get a runtime exception when trying to create a new Tokenizer:

 java.lang.RuntimeException: Could not load dictionaries.

I'm using the kuromoji-ipadic package (com.atilika.kuromoji:kuromoji-ipadic:0.9.0) I've traced through the stack a bit to find this:

Caused by: java.lang.IllegalArgumentException: capacity < 0: -4
     at java.nio.ByteBuffer.allocate(ByteBuffer.java:54)
     at com.atilika.kuromoji.io.IntegerArrayIO.readArray(IntegerArrayIO.java:38)
     at com.atilika.kuromoji.buffer.WordIdMap.<init>(WordIdMap.java:35)
     at com.atilika.kuromoji.dict.TokenInfoDictionary.setup(TokenInfoDictionary.java:168)
     at com.atilika.kuromoji.dict.TokenInfoDictionary.newInstance(TokenInfoDictionary.java:160)
     at com.atilika.kuromoji.ipadic.Tokenizer$Builder.loadDictionaries(Tokenizer.java:219)
     at com.atilika.kuromoji.TokenizerBase.configure(TokenizerBase.java:77) 
     at com.atilika.kuromoji.ipadic.Tokenizer.<init>(Tokenizer.java:74) 
     at com.atilika.kuromoji.ipadic.Tokenizer.<init>(Tokenizer.java:59) 
     at com.atilika.kuromoji.ipadic.Tokenizer$Builder.build(Tokenizer.java:203) 

Which is crashing in IntegerArrayIO at:

public class IntegerArrayIO {

    private static final int INT_BYTES = Integer.SIZE / Byte.SIZE;

    public static int[] readArray(InputStream input) throws IOException {
        DataInputStream dataInput = new DataInputStream(input);
        int length = dataInput.readInt(); // length is returning -1

        ByteBuffer tmpBuffer = ByteBuffer.allocate(length * INT_BYTES); // -1 * 4 = -4, crashes with "capacity < 0:-4"
        ReadableByteChannel channel = Channels.newChannel(dataInput);
        channel.read(tmpBuffer);

I realize the library probably wasn't originally intended for Android, but I'd really like to try it and do some testing to see if it's viable for use on mobile as well.

The code that triggers the original exception is just the construction of the Tokenizer via the builder:

        Tokenizer tokenizer = new Tokenizer.Builder().build();

The IntegerArrayIO class's readArray() method seems to handle fine until trying to load the word ID map:

    public WordIdMap(InputStream input) throws IOException {
        indices = IntegerArrayIO.readArray(input);
        wordIds = IntegerArrayIO.readArray(input); // crashes here
    }

The above code is triggered by TokenInfoDictionary#setup():

        wordIdMap = new WordIdMap(resolver.resolve(TARGETMAP_FILENAME));

Here is the full stack trace (minus the parts that reference my code, which I think are irrelevant):

java.lang.RuntimeException: Could not load dictionaries.
   at com.atilika.kuromoji.ipadic.Tokenizer$Builder.loadDictionaries(Tokenizer.java:231)
   at com.atilika.kuromoji.TokenizerBase.configure(TokenizerBase.java:77)
   at com.atilika.kuromoji.ipadic.Tokenizer.<init>(Tokenizer.java:74)
   at com.atilika.kuromoji.ipadic.Tokenizer.<init>(Tokenizer.java:59)
   at com.atilika.kuromoji.ipadic.Tokenizer$Builder.build(Tokenizer.java:203)

(... my sources omitted for brevity ...)

 Caused by: java.lang.IllegalArgumentException: capacity < 0: -4
   at java.nio.ByteBuffer.allocate(ByteBuffer.java:54)
   at com.atilika.kuromoji.io.IntegerArrayIO.readArray(IntegerArrayIO.java:38)
   at com.atilika.kuromoji.buffer.WordIdMap.<init>(WordIdMap.java:35)
   at com.atilika.kuromoji.dict.TokenInfoDictionary.setup(TokenInfoDictionary.java:168)
   at com.atilika.kuromoji.dict.TokenInfoDictionary.newInstance(TokenInfoDictionary.java:160)
   at com.atilika.kuromoji.ipadic.Tokenizer$Builder.loadDictionaries(Tokenizer.java:219)
   at com.atilika.kuromoji.TokenizerBase.configure(TokenizerBase.java:77) 
   at com.atilika.kuromoji.ipadic.Tokenizer.<init>(Tokenizer.java:74) 
   at com.atilika.kuromoji.ipadic.Tokenizer.<init>(Tokenizer.java:59) 
   at com.atilika.kuromoji.ipadic.Tokenizer$Builder.build(Tokenizer.java:203)

and the sample code I was trying to test with:

        StringBuilder stringBuilder = new StringBuilder();
        Tokenizer tokenizer = new Tokenizer.Builder().build();
        for (Token token : tokenizer.tokenize(text)) {
            stringBuilder.append(token.getSurface());
            stringBuilder.append(" ");
        }

When I test kuromoji on my desktop it seems to work fine on the local JVM, but on Android it crashes. I'll continue to dig deeper and see if I can find out what's going on. If there's anything else you need let me know.

あざーす!

hk0i commented 8 years ago

Update: I get the same exception using the jumandic version, so it's probably more related to kuromoji-core, which is where the IntegerArrayIO.readArray() is being handled

gerryhocks commented 8 years ago

Hi hk0i,

I'm having a look at this issue for you.

If you are testing using an Android Virtual Device, could you please give me an overview of its AVD configuration (version and memory size etc). If you are testing on a real device, a broad overview of that device would be very helpful.

gerryhocks commented 8 years ago

I managed to reproduce the problem - it exists in Kuromoji 1.0 too.

It is caused by the nio wrappers around an InputStream altering its offset in a manner that differs from the regular JVM's nio implementation, presumably due to buffering and readaheads. When loading the WordIdMap, this causes the reading of the second of two arrays read from the same InputStream to fail.

I've have a workaround and have managed to tokenize a sentence successfully in an Android app. I'll improve the performance of this workaround and submit a patch shortly.

cmoen commented 8 years ago

Thanks for the detailed analysis @hk0i and for the fix @gerryhocks. The fix has been merged to master, but I think this fix could warrant a 0.9.1 release to make Android users happy.

hk0i commented 8 years ago

Sorry I couldn't get back to you sooner @gerryhocks. Looks like you figured it out either way.

For the record I was testing on an LG Nexus 5X, but as you said the issue was more generic than device-specific. Thanks for all the hard work everyone.

I was able to create a similar patch on my own as a quick temporary work around but the performance wasn't so great so I'd like to see what you came up with. Tokenization on my 5X takes around 2-3 seconds so I have to do it asynchronously.

I'll dig through the revision history on master to see if I can find it.

gerryhocks commented 8 years ago

No problem @hk0i - the problem manifested on my first attempt, even on an x86 image.. perhaps I was premature asking for extra details.

The changes are not overly extensive and in the same flavour as the earlier version so should be easy to spot.

Out if interest, how much text are you tokenizing when you experience 2-3 seconds of processing?

hk0i commented 8 years ago

I just took a look at your changes and—from memory—the hack that I put in to get things "working" was very different.

As for the amount of text to tokenize—not much. From my experience short sentences and long paragraphs were taking close enough to the same amount of time to tokenize.

The following text (on my implementation, not the fix that was just added) takes 2.52 seconds:

新しい言葉使いましょう

However, something longer—an excerpt from the Wikipedia entry on 日本語—takes 2.40 seconds:

日本語(にほんご、にっぽんご)は、主に日本国内や日本人同士の間で使われている言語である。日本は法令によって

Again this is not the official change from upstream but a different quick fix I found for this in another ticket.

I'll be able to pull the official kuromoji change from master and test the performance of that change later today within the next 5 hours or so.

gerryhocks commented 8 years ago

Understood. Thanks!

Would I be right in assuming that the timing includes the creation of a Tokenizer, followed by the tokenize() call itself?

hk0i commented 8 years ago

That's right, I'm also iterating over the tokens with a foreach loop, something like this:

        Tokenizer tokenizer = new Tokenizer.Builder().build();
        for (Token token : tokenizer.tokenize(mInputText)) {
            /* store tokens in a list... */
        }
hk0i commented 8 years ago

I just gave it a new shot using the code from your fix, @gerryhocks. It actually seems to be a few hundred milliseconds slower than the hot-fix I was using prior.

Testing with the same two sentences this is what I get...

Around 2.7–2.8 seconds for the first sentence:

新しい言葉使いましょう

and around roughly the same amount of time for the second sentence (2.75–2.82s 2.75–2.90s):

日本語(にほんご、にっぽんご)は、主に日本国内や日本人同士の間で使われている言語である。日本は法令によって

The hot fix I was using was less centalized than what you did and more focused on IntegerArrayIO.java specifically:

diff --git a/kuromoji-core/src/main/java/com/atilika/kuromoji/io/IntegerArrayIO.java b/kuromoji-core/src/main/java/com/atilika/kuromoji/io/IntegerArrayIO.java
index 55ab53e..862222d 100644
--- a/kuromoji-core/src/main/java/com/atilika/kuromoji/io/IntegerArrayIO.java
+++ b/kuromoji-core/src/main/java/com/atilika/kuromoji/io/IntegerArrayIO.java
@@ -36,8 +36,9 @@ public class IntegerArrayIO {
         int length = dataInput.readInt();

         ByteBuffer tmpBuffer = ByteBuffer.allocate(length * INT_BYTES);
-        ReadableByteChannel channel = Channels.newChannel(dataInput);
-        channel.read(tmpBuffer);
+        byte[] buffer = new byte[tmpBuffer.remaining()];
+        dataInput.readFully(buffer);
+        tmpBuffer.put(buffer, 0, buffer.length);

         tmpBuffer.rewind();
         IntBuffer intBuffer = tmpBuffer.asIntBuffer();

I'm guessing that because I'm trying to use this code on mobile devices that these things become more time critical because working with my laptop the time taken for tokenization is not even noticeable.

Edit: Updated times for second sentence.. Could it be possible object creation overhead? I think your solution also added some more loops maybe, I'm not sure how the internals of readFully() work.

gerryhocks commented 8 years ago

Thanks again for the detailed info @hk0i.

If you were using the complete master source (1.0), it's worth noting that it uses a different underlying data structure (an FST), whereas the 0.9 release uses a Double Array Trie.

The new structure has a very small performance difference when loading the default dictionaries, but is significantly faster when loading user dictionaries.

These changes could account for the difference in times you are experiencing.

The instantiation of the Tokenizer() is going to be taking most of the time in your example. Depending on how you use the Tokenizer, it might be worth considering using a static instance or a singleton so the Tokenizer is only created once within the app.. naturally this will prevent the tokenizer from being garbage collected, but it might be a reasonable trade-off for you.

hk0i commented 8 years ago

Thanks again @gerryhocks.

I gave it another shot with a static instance and the performance improves significantly, it went from 2-3 for the first instantiation to seconds to 0-100 ms when reusing the same instance. I'll probably go forward with this.

gerryhocks commented 8 years ago

Glad to hear you've found a workable solution, @hk0i. Best of luck with your project.

cmoen commented 8 years ago

Thanks a lot for sorting this out. I'll close the issue.