apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.79k stars 6.79k forks source link

[Clojure] - Provide support for Fasttext embedding in CNN Text Classification example #14118

Closed gigasquid closed 5 years ago

gigasquid commented 5 years ago

Right now the CNN text classification example provides support for glove and word2vec embeddings. It would be great to also provide support for BERT to give users an example of how to integrate that into their code as well.

CNN Text Classification Example: https://github.com/apache/incubator-mxnet/tree/master/contrib/clojure-package/examples/cnn-text-classification

Reference implementation of BERT embedding for MXNet (python) https://github.com/imgarylai/bert-embedding

mxnet-label-bot commented 5 years ago

Hey, this is the MXNet Label Bot. Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it. Here are my recommended labels: Feature

gigasquid commented 5 years ago

This was originally for BERT - but @chouffe helped me understand that it is more complicated than I originally thought with this model. So changing it to https://fasttext.cc/

AlexChalk commented 5 years ago

This sounds like 'figure out how to use the lib and document it as code', so a good ticket for someone new to machine learning? I'll have a go if that's correct.

gigasquid commented 5 years ago

That's correct :) Give a shout if you have any questions, issues. The #clojure-mxnet slack room is also good. See about joining here http://mxnet.incubator.apache.org/versions/master/community/contribute.html

AlexChalk commented 5 years ago

Hi @gigasquid, sorry for delay on this.

The fastText data format looks almost identical to glove, so with a few modifications (e.g. removing line 1 of the data), I think something as simple as this will work.

However, I'm having trouble running the glove (and word2vec) examples off master (osx, mojave). Can you repro this?:

lein repl
(train-convnet {:embedding-size 50 :batch-size 100 :test-size 100 :num-epoch 10 :max-examples 1000 :pretrained-embedding :glove})
=> Loading all the movie reviews from  data/mr-data
=> Loading the glove pre-trained word embeddings from  data/glove/glove.6B.50d.txt
=> Shuffling the data and splitting into training and test sets
=> {:sentence-count 2000, :sentence-size 62, :vocab-size 8078, :embedding-size 50, :pretrained-embedding :glove}
=> ClassCastException [Ljava.lang.Object; cannot be cast to [Lorg.apache.mxnet.Context;  org.apache.clojure-mxnet.module/module (module.clj:65)
gigasquid commented 5 years ago

@adc17 Sorry for the trouble. It looks like the code was refactored and the README instructions weren't updated. It requires a :devs key to tell it whether the run on cpu or gpu and how many devices - see the main code in the classifier for the correct usage.

From the repl you can use (train-convnet {:devs [(context/cpu 0)] :embedding-size 50 :batch-size 100 :test-size 100 :num-epoch 10 :max-examples 1000 :pretrained-embedding :glove}) and it should work.

If you could update the documentation to help others in the future that would be great 😸

AlexChalk commented 5 years ago

No problem, sorry for not spotting this myself.

I should be able to submit a PR this weekend, and I'll update the docs at the same time.

AlexChalk commented 5 years ago

@gigasquid this will take longer than expected, as I'm running into OOMs.

OutOfMemoryError GC overhead limit exceeded  java.util.Arrays.copyOfRange (Arrays.java:3664)

The same thing happens for gloVe when I use the 200d+ word vectors (only without the stack trace).

Seeing as fastText only give us 300d word vectors, I'm a long way off successfully running them.

This kind of memory optimization is something I've never done before; so I'm now out of my depth in terms of making things work with fastText's pretrained .vec files.

I can look at parsing their binary training format in a similar way to what's currently done with word2vec, (those were 300d and my system could use them), but again, I've never really done this before 😟.

For reference, I'm on a late 2016 macbook pro (with the 4-core 2ghz i5, and 8gb ram).

AlexChalk commented 5 years ago

One workaround is to not use all 1M embeddings; I can just take the first 100K from the file. If that sounds ok, let me know and I'll submit a PR.

AlexChalk commented 5 years ago

Scratch that, I've just discovered the 'wiki.simple' pretrained embeddings that are short enough to handle running locally 🎊: https://github.com/apache/incubator-mxnet/pull/15340