Closed Colin-Codes closed 5 years ago
Key phrases like knowledge discovery / search seem promising
Potential approach - featurization / vectorisation of text data
https://medium.com/@paritosh_30025/natural-language-processing-text-data-vectorization-af2520529cf7
Could be used to map new question onto old questions, which are then linked to answers.
To convert string data into numerical data one can use following methods
· Bag of words - simplistic but probably ideal
· TFIDF - gives less weight to frequent words
· Word2Vec - creates word-by-word embedding
Factorization machines are good for creating rankings of recommendations
https://www.analyticsvidhya.com/blog/2018/01/factorization-machines/
2 processes:
Displaying results Find which cluster query fits in (BOW or TFIDF / k-means) Get average article scores for queries in this cluster Display in descending order of score
Training the results Use a factorisation machine to score the articles based on user behaviour (user clicks, back button, time etc.) Store the scores against the query ID
Question Answering Models
Research papers:
https://www.sciencedirect.com/science/article/pii/S2212017313005409 - Good overview, classification, history https://www.researchgate.net/publication/221020490_Knowledge-Based_Question_Answering - constructing a knowledge base from documents https://www.aclweb.org/anthology/W18-3105 - introduces answer selection
Blogs:
https://medium.com/@japneet121/word-vectorization-using-glove-76919685ee0b https://towardsdatascience.com/building-a-question-answering-system-part-1-9388aadff507 https://towardsdatascience.com/automatic-question-answering-ac7593432842 https://towardsdatascience.com/nlp-building-a-question-answering-model-ed0529a68c54
Linguistic approach Good for answering short, domain specific queries
Statistical approach
Template approach
Pattern matching
using historic question answering, not just analysing documentation
Comparison of systems:
Question-Question Mapping Systems Built up from historic questions and data pre-processed as questions
vs
Question-Answer Mapping systems
Recurrent vs feedforward - the RNN uses internal memory to process sequences of inputs
https://en.wikipedia.org/wiki/Recurrent_neural_network - history and context
https://medium.com/paper-club/grus-vs-lstms-e9d8e2484848 - not conclusive
https://www.aclweb.org/anthology/C18-1181 deep learning and answer selection
https://arxiv.org/pdf/1412.3555v1.pdf - Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling but actually also considers LSTM too
https://www.microsoft.com/en-us/research/wp-content/uploads/2017/05/r-net.pdf - r-net an impressive implementation previsouly leading the SQuad scroeboard
replication here: https://yerevann.github.io/2017/08/25/challenges-of-reproducing-r-net-neural-network-using-keras/
GRUs seem to perform better with fewer data, but LSTMs can get ahead if given enough data.
https://medium.com/mlrecipies/deep-learning-basics-gated-recurrent-unit-gru-1d8e9fae7280
Denoising autoencoders for pre-processing: https://towardsdatascience.com/denoising-autoencoders-explained-dbb82467fc2
DAE, stands for “Data augmentation” and “Domain adaption”
https://rajpurkar.github.io/SQuAD-explorer/ - Stamford question answering dataset
https://arxiv.org/abs/1611.01603: Bidirectional Attention Flow for Machine Comprehension BiDaf
attention is all you need
Question answering with transformers https://web.stanford.edu/class/cs224n/reports/default/15782330.pdf
Related to #3
Find some research papers to draw initial inspiration from.