Closed ConnorBarnhill closed 7 years ago
Isn't that just a BoW vectorizer?
@raver119 read gitter first please :D. This stemmed from a discussion of auto import all the way to just documenting what the dl4j equivalent of the keras one is. @cb1542 to be clear we have these features already. I'd like to figure out what we want to do here. Will let @turambar comment.
@raver119 It keeps the sequence. Each word is mapped to an integer representing the word's rank in the corpus.
tokenizer.fit("I really really really really like it", "like it" "like") tokenizer.tokenize("I like it") => [4,2,3]
@cb1542 @raver119 TL;DR Concur with @agibsonccc and @raver119: Between DataVec and deeplearning4j-nlp, I'm pretty sure this is supported in some form. The question is how to facilitate "import" or replication of a python pipeline using Keras' tokenizer in the DL4J ecosystem.
In the short term, I think the best we can do is provide documentation on the Keras model import page or examples illustrating how to replicate it. I'll invite @tomthetrainer to comment. Here's the link to Keras' text preprocessing stuff. They have stuff for images and sequences as well.
We can consider other options for the long run.
This particular tokenizer isn't supported. By "word rank" it means "word index in list sorted by frequency " i guess, so it should be very easy to add such tokenizer, because w2v routines require the same sorting for Huffman tree.
P.s it's vectorizer in terms of dl4j/datavec. Tokenizer is still whitespace as i see. And atm we have 3 vectorizers: bow, tf-idf, w2v. This one can be fourth.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Trying to import an LSTM for text classification written in Keras (in the spirit of this article ). The model import from Keras to dl4j is straightforward, but it isn't clear the best way to transfer the tokenizer (docs) to dl4j. Docs or code would help.