atilika / kuromoji

Kuromoji is a self-contained and very easy to use Japanese morphological analyzer designed for search
Apache License 2.0
959 stars 131 forks source link

Kuromoji on Android #96

Closed jendib closed 8 years ago

jendib commented 8 years ago

Hello and thank you for your library.

I tried to use Kuromoji on Android (actually it's a bit overkill for me, I only try to convert Japanese text to romaji for pronounciation). I encountered this error:

java.lang.IllegalArgumentException: capacity < 0: -4
  at java.nio.ByteBuffer.allocate(ByteBuffer.java:54)
  at com.atilika.kuromoji.io.IntegerArrayIO.readArray(IntegerArrayIO.java:38)
  at com.atilika.kuromoji.buffer.WordIdMap.<init>(WordIdMap.java:35)
  at com.atilika.kuromoji.dict.TokenInfoDictionary.setup(TokenInfoDictionary.java:168)
  at com.atilika.kuromoji.dict.TokenInfoDictionary.newInstance(TokenInfoDictionary.java:160)
  at com.atilika.kuromoji.ipadic.Tokenizer$Builder.loadDictionaries(Tokenizer.java:219)

I guessed I hit a memory limit and this is not the library I'm looking for.

Can you confirm? And do you have a better idea to extract romaji from japanese text?

Thanks a lot.

Lakedaemon commented 8 years ago

On 01/02/2016 12:45 AM, Benjamin Gamard wrote:

Hello and thank you for your library

I tried to use Kuromoji on Android (actually it's a bit overkill for me, I only try to convert Japanese text to romaji for pronounciation)

I'm using the JapaneseTokenizer from Lucene 4.7.2 (the last lucene release that is usable on android because starting from lucene 4.8, they started using try with resources). It comes from an earlier version of Kuromoji and works just fine (it's awesome actually).

It's usefull for japanese text to romaji (I'm using it in the Tenjin Dictionary), but also for glossing, grammar parsing, etc...(really awesome)

You'll need around 10/16 Mb of Mmaped bytebuffer (-> native memory doesn't count for your app limit) to use it...

I encountered this error:

javalangIllegalArgumentException: capacity < 0: -4 at javanioByteBufferallocate(ByteBufferjava:54) at comatilikakuromojiioIntegerArrayIOreadArray(IntegerArrayIOjava:38) at comatilikakuromojibufferWordIdMap(WordIdMapjava:35) at comatilikakuromojidictTokenInfoDictionarysetup(TokenInfoDictionaryjava:168) at comatilikakuromojidictTokenInfoDictionarynewInstance(TokenInfoDictionaryjava:160) at comatilikakuromojiipadicTokenizer$BuilderloadDictionaries(Tokenizerjava:219)

I guessed I hit a memory limit and this is not the library I'm looking for

Can you confirm? And do you have a better idea to extract romaji from japanese text?

If your text has kanji, Kuromoji is your best tool for the job If your text only has kana, you can write a simple class that converts kana to romaji

Thanks a lot

— Reply to this email directly or view it on GitHub https://github.com/atilika/kuromoji/issues/96.

jendib commented 8 years ago

Great answer, thanks a lot! I will try using the Lucene package. I'm closing this non-issue.

KonsomeJona commented 8 years ago

Hello, I got the same problem here on Android.

java.lang.IllegalArgumentException: capacity < 0: -4

Using JapaneseTokenizer from Lucene 4.7.2 is workaround, not a fix. So this issue shouldn't be closed. I guess the library can't load the library (probably because of the try with resources as mentioned above?). Does anyone has a solution or a fix?

Lakedaemon commented 8 years ago

I was using the JapaneseTokenizer on Android GingerBread (that was severely limited by memory). The way I got it to work was : 1) use ipadic (it's the most compact dictionary for Kuromoji) 2) load the ConnectionCost matrix and the TokenInfo data through mmap and a direct Byte Buffer (native memory doesn't count, not an allocating byteBuffer) 3) modify the JapaneseTokenizer classes (with custom IntSlice) so they work with a ByteBuffer instead of loading resource files 4) you have to provide the byteBuffers to the singletons before you use them

dict.zip

Here are some of my modifications (based on Lucene 4.7.0)

I use this (in Kotlin, not java, java sucks) like this ` fun ZipsFolder.buildDictionaries3() { val readOnlyZipEntry = readOnlyZipEntry(TOKEN_INFO_BUFFER)?: throw Exception("missing File $TOKEN_INFO_BUFFER") val readOnlyZipEntryB = readOnlyZipEntry(TOKEN_INFO_FST)?: throw Exception("missing File $TOKEN_INFO_FST")

val byteBuffer = (readOnlyZipEntry.zipFile.inputStream() as FileInputStream).channel.use {
    it.map(FileChannel.MapMode.READ_ONLY, readOnlyZipEntry.offset, readOnlyZipEntry.mUncompressedLength)
}

GZIPInputStream(readOnlyZipEntryB.zipFile.inputStream().apply{skip(readOnlyZipEntryB.offset)}).use {
    TokenInfoDictionary.build(it, readOnlyZipEntry.extra, byteBuffer)
}
UnknownDictionary.build(null as InputStream?)

}

fun ZipsFolder.buildConnectionCosts3() { val readOnlyZipEntry = readOnlyZipEntry(CONNECTION_COSTS)?: throw Exception("missing File $CONNECTION_COSTS")

val byteBuffer = (readOnlyZipEntry.zipFile.inputStream() as FileInputStream).channel.use {
    it.map(FileChannel.MapMode.READ_ONLY, readOnlyZipEntry.offset, readOnlyZipEntry.mUncompressedLength)
}
SpartanConnectionCosts.build(readOnlyZipEntry.extra, byteBuffer)

} ` You'll probably have to adapt my code a bit

KonsomeJona commented 8 years ago

Thank you for your help! You should release on GitHub an example of project that makes JapaneseTokenizer working on Android! I'm sure it would be helpful for a lot of people!

Lakedaemon commented 8 years ago

On 02/10/2016 10:11 AM, Jonathan Odul wrote:

Thank you for your help! You should release on GitHub an example of project that makes JapaneseTokenizer working on Android! I'm sure it would be helpful for a lot of people!

sure, but I don't have the time to do that. This was the best I could do

— Reply to this email directly or view it on GitHub https://github.com/atilika/kuromoji/issues/96#issuecomment-182267283.

sdcr commented 8 years ago

I had the same problem as the OP, and found another workaround.

The reason why the capacity argument is <0, is because WordIdMap uses IntegerArrayIO#readArray, which reads the capacity from the data stream (see IntegerArrayIO.java#L36). Digging deeper, WordIdMap calls readArray twice on the same stream, and it turns out that during the first call, the part

ByteBuffer tmpBuffer = ByteBuffer.allocate(length * INT_BYTES);
ReadableByteChannel channel = Channels.newChannel(dataInput);
channel.read(tmpBuffer);

in readArray reads less than length * INT_BYTES bytes from the stream. On the second call, the capacity is therefore read from the wrong location, and the capacity arbitrarily becomes -1. Going even further, the instance I get as a Channel is a InputStreamChannel, which uses

    byte[] buffer = new byte [dst.remaining()];
    int readBytes = in.read (buffer);

to read from the stream. According to the Java doc, read may read less then the whole array's length of bytes. So instead of using a Channel, I directly filled a byte array using readFully on the DataInputStream, which fixed the problem. See commitc496e0 in my fork.

KonsomeJona commented 8 years ago

@sdcr Thank you for the fix, works like a charm.

weituotian commented 7 years ago

So what can we do to solve this problem?