The article:

https://medium.com/@adriensieg/text-similarities-da019229c894

Methods studied in the article

Jaccard Similarity
Different embeddings+ K-means
Different embeddings+ Cosine Similarity
Word2Vec + Smooth Inverse Frequency + Cosine Similarity
Different embeddings+LSI + Cosine Similarity
Different embeddings+ LDA + Jensen-Shannon distance
Different embeddings+ Word Mover Distance
Different embeddings+ Variational Auto Encoder (VAE)
Different embeddings+ Universal sentence encoder
Different embeddings+ Siamese Manhattan LSTM
Knowledge-based Measures

SOURCES

[Incredible !!!] : https://github.com/nlptown/nlp-notebooks
- An Introduction to Word Embeddings
- Data exploration with sentence similarity
- Discovering and Visualizing Topics in Texts with LDA (en français !)
- Keras sentiment analysis with Elmo Embeddings
- Multilingual Embeddings - 1. Introduction
- Multilingual Embeddings - 2. Cross-lingual Sentence Similarity
- Multilingual Embeddings - 3. Transfer Learning
- NLP with pretrained models - spaCy and StanfordNLP
- Named Entity Recognition with Conditional Random Fields
- Sequence Labelling with a BiLSTM in PyTorch [sequence labelling tasks such as part-of-speech tagging or named entity recognition]
- Simple Sentence Similarity [Word Mover’s Distance + Smooth Inverse Frequency + InferSent + Google Sentence Encoder + Pearson correlation]
- Text classification with BERT in PyTorch
- Text classification with a CNN in PyTorch
- Traditional text classification with Scikit-learn [ELI5]
- Updating spaCy’s Named Entity Recognition System
[Tutorial WMD in jupyter notebook] : https://github.com/makcedward/nlp/blob/master/sample/nlp-word_mover_distance.ipynb
[Word Mover Distance] : https://www.kaggle.com/ankitswarnkar/word-embedding-using-glove-vector
[lstm-gru-sentiment-analysis] : https://github.com/javaidnabi31/Word-Embeddding-Sentiment-Classification

https://towardsdatascience.com/elmo-contextual-language-embedding-335de2268604

[Learning Word Embedding (Mathematics)] https://lilianweng.github.io/lil-log/2017/10/15/learning-word-embedding.html
[A Beginner’s Aha Moments for Word2Vec] : https://yidatao.github.io/2017-08-03/word2vec-aha/
[Glove, Word2Vec, Fastext classes] : https://github.com/makcedward/nlp/blob/master/sample/nlp-word_embedding.ipynb
[!!! Very nice tutorial about how word2vec works] : https://towardsdatascience.com/word2vec-made-easy-139a31a4b8ae
[!!! An implementation guide to Word2Vec using NumPy and Google Sheets] : https://towardsdatascience.com/an-implementation-guide-to-word2vec-using-numpy-and-google-sheets-13445eebd281
[WordRank embedding: “crowned” is most similar to “king”, not word2vec’s “Canute”] : https://rare-technologies.com/wordrank-embedding-crowned-is-most-similar-to-king-not-word2vecs-canute/
[Nice !!! ]: https://github.com/JacopoMangiavacchi/SwiftNLC/tree/master/ModelNotebooks
- Create Model With GloVe Embedding Bidirectional With Attention
- Create Model With FastText Embedding
- Create Model With GloVe Embedding
- Create Model With GloVe Embedding Bidirectional
- Create Model With NLTK Embedding
- Create Model With NS Linguistic Tagger Embedding
[Tutorial from ENSAE] : http://www.xavierdupre.fr/app/papierstat/helpsphinx/notebooks/text_sentiment_wordvec.html#les-donnees
[CoreML with GloVe Word Embedding and Recursive Neural Network - nice tutorial] : https://medium.com/@JMangia/coreml-with-glove-word-embedding-and-recursive-neural-network-part-2-ab238ca90970
[Big Benchmark] : http://nlp.town/blog/sentence-similarity/
- Average W2V
- Average W2V + Stopwords
- Average W2V + TFIDF
- Average W2V + TFIDF + Stopwords
- Average Glove
- Average Glove + Stopwords
- Average Glove + TFIDF
- Average Glove + TFIDF + Stopwords
- W2V + WMD
- W2V + Stopwords + WMD
- Glove + WMD
- Glove + Stopwords + WMD
- Smooth Inverse Frequency + W2V
- Smooth Inverse Frequency + Glove
- InferSent (INF)
- GSE (Google Sentence Encoder)
InferSent (INF) = pre-trained encoder that was developed by Facebook Research. It is a BiLSTM with max pooling, trained on the SNLI dataset, 570k English sentence pairs labelled with one of three categories: entailment, contradiction or neutral.

GSE (Google Sentence Encoder) = Google’s answer to Facebook’s InferSent. It comes in two forms:
- an advanced model that takes the element-wise sum of the context-aware word representations produced by the encoding subgraph of a Transformer model
- a simpler Deep Averaging Network (DAN) where input embeddings for words and bigrams are averaged together and passed through a feed-forward deep neural network.
====-> work with Pearson correlation
[How to predict Quora Question Pairs using Siamese Manhattan LSTM] : https://medium.com/mlreview/implementing-malstm-on-kaggles-quora-question-pairs-competition-8b31b0b16a07
[Latent Semantic Indexing (LSI) - An Example with mathematics] : http://www1.se.cuhk.edu.hk/~seem5680/lecture/LSI-Eg.pdf
[Finding similar documents with Word2Vec and WMD] : https://markroxor.github.io/gensim/static/notebooks/WMD_tutorial.html
[Cosine Similarity] : https://www.machinelearningplus.com/nlp/cosine-similarity/
[Tutorial on LSI] : http://poloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-820-TextAlgorithms.pdf

http://robotics.stanford.edu/~scohen/research/emdg/emdg.html#flow_eqw_notopt

http://robotics.stanford.edu/~rubner/slides/sld014.htm

http://jxieeducation.com/2016-06-13/Document-Similarity-With-Word-Movers-Distance/

[Beyond Cosinu > Jensen-Shannon + Hypothesis Test] : http://stefansavev.com/blog/beyond-cosine-similarity/
[Great ressources with MANY MANY notebooks] : https://www.renom.jp/index.html?c=tutorial
[LE TRANSPORT OPTIMAL, COUTEAU SUISSE POUR LA DATA SCIENCE]: https://weave.eu/le-transport-optimal-un-couteau-suisse-pour-la-data-science/
[BEST TUTORIAL variational-autoencoders] : https://www.jeremyjordan.me/variational-autoencoders/
[BEST TUTORIAL : Earthmover Distance !!!!!] : https://jeremykun.com/2018/03/05/earthmover-distance/

Problem: Compute distance between points with uncertain locations (given by samples, or differing observations, or clusters).

[Introduction to Wasserstein metric (earth mover’s distance) -> Mathematics]: https://yoo2080.wordpress.com/2015/04/09/introduction-to-wasserstein-metric-earth-movers-distance/
[Word Mover’s distance calculation between word pairs of two documents] : https://stats.stackexchange.com/questions/303050/word-movers-distance-calculation-between-word-pairs-of-two-documents
[WMD + Word2Vec] : https://github.com/stephenhky/PyWMD/blob/master/WordMoverDistanceDemo.ipynb
[Books about Optimal Transport] : https://optimaltransport.github.io/pdf/ComputationalOT.pdf
[NICE !!!!!! How Autoencoders work - Understanding the math and implementation] : https://www.kaggle.com/shivamb/how-autoencoders-work-intro-and-usecases
[Word2Vec to convert each question into a semantic vector then stack a Siamese network to detect if the pair is duplicate] : http://www.erogol.com/duplicate-question-detection-deep-learning/
[Amazing !!!] : https://github.com/makcedward/nlp
- Distance Measurement:
  - Euclidean Distance, Cosine Similarity and Jaccard Similarity
  - Edit Distance + Levenshtein Distance
  - Word Moving Distance (WMD)
  - Supervised Word Moving Distance (S-WMD)
  - Manhattan LSTM
- Text Representation:
  
  1. Traditional Method
  - Bag-of-words (BoW)
  - Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA)
  2. Character Level
  - Character Embedding
  3. Word Level
  - Negative Sampling and Hierarchical Softmax
  - Word2Vec, GloVe, fastText
  - Contextualized Word Vectors (CoVe)
  - Embeddings from Language Models (ELMo)
  - Generative Pre-Training (GPT)
  - Contextual String Embeddings
  - Self-Governing Neural Networks (SGNN)
  - Multi-Task Deep Neural Networks (MT-DNN)
  - Generative Pre-Training-2 (GPT-2)
  - Universal Language Model Fine-tuning (ULMFiT)
  4. Sentence Level
  - Skip-thoughts
  - InferSent
  - Quick-Thoughts
  - General Purpose Sentence (GenSen)
  - Bidirectional Encoder Representations from Transformers (BERT)

[Zoom] : Google Sentence Encoder

[Reference] : https://ai.googleblog.com/2018/05/advances-in-semantic-textual-similarity.html
[Waaaaaaaaaaaa] :https://machinelearningmastery.com/encoder-decoder-deep-learning-models-text-summarization/

adsieg / text_similarity

readme

The article:

https://medium.com/@adriensieg/text-similarities-da019229c894

Methods studied in the article

SOURCES

[Zoom] : Google Sentence Encoder