Tokenizer is not serializable for Apache Spark

atilika / kuromoji

Kuromoji is a self-contained and very easy to use Japanese morphological analyzer designed for search

Apache License 2.0

945 stars 130 forks source link

Tokenizer is not serializable for Apache Spark #85

Open lamrongol opened 8 years ago

lamrongol commented 8 years ago

On Apache Spark, instances must be serializable for parallel processing but kuromoji tokenizers are not and must be initialized for each time. If tokenizers are serializable, we can decrease processing time.

cmoen commented 8 years ago

Thanks a lot Fujikawa-san.

Instantiating Kuromoji takes a bit of time since it reads a fairly large dictionaries into memory. Could you clarify how making them serializable would help this in the context of Spark?

I just don't know the detailed mechanisms and I'd appreciate if you could explain. Thanks!

lamrongol commented 8 years ago

Spark serialize whole class at the beginning and then process it by each machine parallelly. Therefore, if unserializable instance is contained it throws error, and you must initialize each time like following link http://www.intellilink.co.jp/article/column/bigdata-kk01.html

lamrongol commented 8 years ago

I've tried to make kuromoji-core classes Serializable but been not to able to serialize Tokenizer because java.nio.HeapByteBuffer is unserializable. This work may take a lot of trouble

lamrongol commented 8 years ago

This is changes I made(Sorry, unnecessary space diff included) https://github.com/lamrongol/kuromoji/commit/415e0fbc242d891e0708aaeacbb7a18ed478fee9

by using my tool https://github.com/lamrongol/MakeJavaClassSerializable

akkikiki commented 8 years ago

I was looking into "Tuning Spark" document on Spark 1.2.0 and there is a section mentioning that using serialization will help reduce the memory usage on Spark. Perhaps Fujikawa-san is trying do something similar to it?

It is interesting that there is also a downside on this:

The only downside of storing data in serialized form is slower access times, due to having to deserialize each object on the fly.

lamrongol commented 8 years ago

@akkikiki If not serializable, Spark doesn't work. https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/javaionotserializableexception.html

By the way, I think we all understand Japanese and it's no problem to write in Japanese, isn't it? ところで、ここに書いてる人はみな日本語を理解してると思うので日本語で書いても問題ないのではないでしょうか？

lamrongol commented 8 years ago

Sorry I'm not familiar to Kuromoji but I think Kuromoji reads dictionary file when processing and it is not suited to Serializable. If Kuromoji has new mode to contain all data in memory, it become Serializable, I think.