word2vec/doc2vec example

NonaryR commented 6 years ago

Hello, @aria42! It's not an issue, just a question. Can you please provide some w2v example? I'm asking it because w2v is a really common algorithm, using in many fields, and Clojure implementation will be super useful. And w2v is less hard to understand compare to LSTM-based architecture. I really want to start using this library, but not quite understand all of the concepts, so, maybe more examples will be helpful for people like me? Thank you!

NonaryR commented 6 years ago

@aria42 just remind, still waiting for answer

ronaldyang commented 6 years ago

@NonaryR Here is my understanding: flare does not provide initial training and computation of w2v, but looks up word embeddings from pre-trained models (e.g. Stanford glove dataset). It brings up several benefits in both training speed and model performance.

If you really want an example, with which you can train your own w2v in your corpus, maybe some tweaks of data-generator in logistic_regression.clj (i.e. feed some *-gram generated training pairs instead of random data) could give you something close. The final activations of the trained model is the embedding you need.

aria42 commented 6 years ago

If I get some time, unlikely, I can try to build this out, but given a source of text this reduces to just training logistic regression if you do negative sampling as Ronald suggested. The best way to learn is to do so I'd absolutely welcome a PR

On Thu, Feb 22, 2018 at 6:26 PM, Ronald Yang notifications@github.com wrote:

@NonaryR https://github.com/nonaryr Here is my understanding: flare does not provide initial training and computation of w2v, but looks up word embeddings from pre-trained models (e.g. Stanford glove dataset). It brings up several benefits in both training speed and model performance.

If you really want an example, with which you can train your own w2v in your corpus, maybe some tweaks of data-generator in logistic_regression.clj (i.e. feed some *-gram generated training pairs instead of random data) could give you something close. The final activations of the trained model is the embedding you need.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/aria42/flare/issues/6#issuecomment-367890339, or mute the thread https://github.com/notifications/unsubscribe-auth/AAB5wlak87zppwpEY9pCEbiZRgcLqWgTks5tXiG4gaJpZM4RuA4y .

-- website: http://aria42.com

NonaryR commented 6 years ago

@ronaldyang w2v it's just an algorithm, right? So, nothing special about using only glove vectors - and I'm using theirs very often, but sometimes texts come from the specific area, and I need to rebuild this vectors. And more generally - w2v just an implementation for anything2vec algorithm with some tricks like negative sampling. I want to have some basics steps for the algorithm, and those other tricks will be on my side.

aria42 / flare

word2vec/doc2vec example #6