Support for Keras-like text tokenizer

deeplearning4j / deeplearning4j

Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learn...

http://deeplearning4j.konduit.ai

Apache License 2.0

13.71k stars 3.84k forks source link

Support for Keras-like text tokenizer #2844

Closed ConnorBarnhill closed 7 years ago

ConnorBarnhill commented 7 years ago

Trying to import an LSTM for text classification written in Keras (in the spirit of this article ). The model import from Keras to dl4j is straightforward, but it isn't clear the best way to transfer the tokenizer (docs) to dl4j. Docs or code would help.

raver119 commented 7 years ago

Isn't that just a BoW vectorizer?

agibsonccc commented 7 years ago

@raver119 read gitter first please :D. This stemmed from a discussion of auto import all the way to just documenting what the dl4j equivalent of the keras one is. @cb1542 to be clear we have these features already. I'd like to figure out what we want to do here. Will let @turambar comment.

ConnorBarnhill commented 7 years ago

@raver119 It keeps the sequence. Each word is mapped to an integer representing the word's rank in the corpus.

tokenizer.fit("I really really really really like it", "like it" "like") tokenizer.tokenize("I like it") => [4,2,3]

turambar commented 7 years ago

@cb1542 @raver119 TL;DR Concur with @agibsonccc and @raver119: Between DataVec and deeplearning4j-nlp, I'm pretty sure this is supported in some form. The question is how to facilitate "import" or replication of a python pipeline using Keras' tokenizer in the DL4J ecosystem.

In the short term, I think the best we can do is provide documentation on the Keras model import page or examples illustrating how to replicate it. I'll invite @tomthetrainer to comment. Here's the link to Keras' text preprocessing stuff. They have stuff for images and sequences as well.

We can consider other options for the long run.

raver119 commented 7 years ago

This particular tokenizer isn't supported. By "word rank" it means "word index in list sorted by frequency " i guess, so it should be very easy to add such tokenizer, because w2v routines require the same sorting for Huffman tree.

P.s it's vectorizer in terms of dl4j/datavec. Tokenizer is still whitespace as i see. And atm we have 3 vectorizers: bow, tf-idf, w2v. This one can be fourth.

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.