Schwittleymani / ECO

Electronic Chaos Oracle
https://schwittlick.net/eco
Apache License 2.0
6 stars 1 forks source link

Setup Doc2Vec Experiments #178

Open schwittlick opened 7 years ago

schwittlick commented 7 years ago

here are a few links on doc2vec stuff: #135

schwittlick commented 7 years ago

research doc2vec https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb

found here https://rare-technologies.com/doc2vec-tutorial/

schwittlick commented 7 years ago

This module does pretty well text similarity analysis on text blocks(documents): https://github.com/chrisjmccormick/simsearch

Seems super valuable for training models on the txt data and finding similar sentences.

One interesting approach could be to generate hundreds of thousands of lines via an RNN with different seeds for different purposes. For example:

These pre-generated lines will be the Corpus to find the most similar answer to the question typed into ECO. The similarity search via simsearch should be able to find something quickly that is talking about a similar topic. Interesting could be to use 50% of the time the database of the RNN generated answers and 50% of the time use the original sentences parsed from the PDFs and find the most similar sentence in these.

schwittlick commented 7 years ago

compare:

  1. doc2vec similarities (https://radimrehurek.com/gensim/models/doc2vec.html)
  2. own implemented similarity (https://github.com/mrzl/ECO/blob/master/src/python/nlp/word2vec.py#L31)
  3. word2vec word sequence similarity (model.n_similarity(['i', 'like', 'brown', 'turtles'], ['i', 'like', 'dark', 'brown', 'turtles']))
  4. simsearch sentence similarity
schwittlick commented 7 years ago

re-read this: https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-wikipedia.ipynb