Initial research - Githubissues

Colin-Codes / IntentClassifier-ML-Project

Pyhton, Keras, SciKit-Learn, Matplotlib: Machine learning research project around classification of intent behind tech support emails in order to enable automatic follow up.

0 stars 0 forks source link

Initial research #9

Closed Colin-Codes closed 5 years ago

Colin-Codes commented 5 years ago

Related to #3

Find some research papers to draw initial inspiration from.

Colin-Codes commented 5 years ago

Key phrases like knowledge discovery / search seem promising

Colin-Codes commented 5 years ago

Potential approach - featurization / vectorisation of text data

https://medium.com/@paritosh_30025/natural-language-processing-text-data-vectorization-af2520529cf7

Could be used to map new question onto old questions, which are then linked to answers.

Colin-Codes commented 5 years ago

To convert string data into numerical data one can use following methods

· Bag of words - simplistic but probably ideal

· TFIDF - gives less weight to frequent words

· Word2Vec - creates word-by-word embedding

Colin-Codes commented 5 years ago

Factorization machines are good for creating rankings of recommendations

https://www.analyticsvidhya.com/blog/2018/01/factorization-machines/

2 processes:

Displaying results Find which cluster query fits in (BOW or TFIDF / k-means) Get average article scores for queries in this cluster Display in descending order of score

Training the results Use a factorisation machine to score the articles based on user behaviour (user clicks, back button, time etc.) Store the scores against the query ID

Colin-Codes commented 5 years ago

Question Answering Models

Research papers:

https://www.sciencedirect.com/science/article/pii/S2212017313005409 - Good overview, classification, history https://www.researchgate.net/publication/221020490_Knowledge-Based_Question_Answering - constructing a knowledge base from documents https://www.aclweb.org/anthology/W18-3105 - introduces answer selection

Blogs:

https://medium.com/@japneet121/word-vectorization-using-glove-76919685ee0b https://towardsdatascience.com/building-a-question-answering-system-part-1-9388aadff507 https://towardsdatascience.com/automatic-question-answering-ac7593432842 https://towardsdatascience.com/nlp-building-a-question-answering-model-ed0529a68c54

Linguistic approach Good for answering short, domain specific queries

Statistical approach

Template approach

Pattern matching

using historic question answering, not just analysing documentation

Colin-Codes commented 5 years ago

Comparison of systems:

Question-Question Mapping Systems Built up from historic questions and data pre-processed as questions

Question-Answer Mapping systems

Colin-Codes commented 5 years ago

https://towardsdatascience.com/how-the-current-best-question-answering-model-works-8bbacf375e2a RNNS with GRUs

Colin-Codes commented 5 years ago

Recurrent vs feedforward - the RNN uses internal memory to process sequences of inputs

Colin-Codes commented 5 years ago

https://en.wikipedia.org/wiki/Recurrent_neural_network - history and context

Colin-Codes commented 5 years ago

https://medium.com/paper-club/grus-vs-lstms-e9d8e2484848 - not conclusive

Colin-Codes commented 5 years ago

https://www.aclweb.org/anthology/C18-1181 deep learning and answer selection

Colin-Codes commented 5 years ago

https://towardsdatascience.com/understanding-gru-networks-2ef37df6c9be

Colin-Codes commented 5 years ago

https://arxiv.org/pdf/1412.3555v1.pdf - Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling but actually also considers LSTM too

Colin-Codes commented 5 years ago

https://www.microsoft.com/en-us/research/wp-content/uploads/2017/05/r-net.pdf - r-net an impressive implementation previsouly leading the SQuad scroeboard

replication here: https://yerevann.github.io/2017/08/25/challenges-of-reproducing-r-net-neural-network-using-keras/

Colin-Codes commented 5 years ago

BERT - https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html

Colin-Codes commented 5 years ago

GRUs seem to perform better with fewer data, but LSTMs can get ahead if given enough data.

https://medium.com/mlrecipies/deep-learning-basics-gated-recurrent-unit-gru-1d8e9fae7280

Colin-Codes commented 5 years ago

Denoising autoencoders for pre-processing: https://towardsdatascience.com/denoising-autoencoders-explained-dbb82467fc2

DAE, stands for “Data augmentation” and “Domain adaption”

Colin-Codes commented 5 years ago