Setup Doc2Vec Experiments

schwittlick commented 7 years ago

here are a few links on doc2vec stuff: #135

schwittlick commented 7 years ago

research doc2vec https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb

found here https://rare-technologies.com/doc2vec-tutorial/

schwittlick commented 7 years ago

This module does pretty well text similarity analysis on text blocks(documents): https://github.com/chrisjmccormick/simsearch

Seems super valuable for training models on the txt data and finding similar sentences.

One interesting approach could be to generate hundreds of thousands of lines via an RNN with different seeds for different purposes. For example:

10000 sentences beginning with 'Do you'
10000 sentences beginning with 'I don't know, but'
10000 sentences beginning with 'Do you say that' ... etc ... (find many different ways of answering a sentence)

These pre-generated lines will be the Corpus to find the most similar answer to the question typed into ECO. The similarity search via simsearch should be able to find something quickly that is talking about a similar topic. Interesting could be to use 50% of the time the database of the RNN generated answers and 50% of the time use the original sentences parsed from the PDFs and find the most similar sentence in these.

schwittlick commented 7 years ago

compare:

doc2vec similarities (https://radimrehurek.com/gensim/models/doc2vec.html)
own implemented similarity (https://github.com/mrzl/ECO/blob/master/src/python/nlp/word2vec.py#L31)
word2vec word sequence similarity (model.n_similarity(['i', 'like', 'brown', 'turtles'], ['i', 'like', 'dark', 'brown', 'turtles']))
simsearch sentence similarity

schwittlick commented 7 years ago

re-read this: https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-wikipedia.ipynb

Schwittleymani / ECO

Setup Doc2Vec Experiments #178