Open vdpappu opened 5 years ago
Siamese networks need supervised samples and generating semi-supervised samples through semi-supervised approach is not feasible. An alternative approach is to calculate the similarity between noun-phrases and/or useful verbs in the sentences reduces noise by not considering the contributions from filler words.
Initial approach: https://github.com/etherlabsio/hinton/tree/experiment/sentence_relatedness/sentence_relatedness
Few issues to address:
Analyzing key-phrase scores across different layers to get the layer combination that gives stable results
worked on Candidate KP based similarity - https://github.com/etherlabsio/hinton/tree/key-phrase_scorer Currently benchmarking on the opensource dataset https://www.kaggle.com/c/quora-question-pairs to quantify the performance gains.
GitHub
GitHub is where people build software. More than 36 million people use GitHub to discover, fork, and contribute to over 100 million projects.
Can you identify question pairs that have the same intent?
QQP is not the right dataset for cosine similarity. Will be working on analyzing the results on: http://alt.qcri.org/semeval2014/task3/index.php?id=data-and-tools
Another dataset which is used for benchmarking similarity tasks - http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark
Currently, we use Cosine Similarity for similarity metric. With complex architectures like BERT, it may not be effective as the objective functions used for pre-training or fine-tuning does not directly reflect sentence relatedness without labelled dataset. Siamese Network (https://www.youtube.com/watch?v=6jfw8MuKwpI) provides an effective alternative for similarity tasks. This involves training a small network whose outputs highlights the similarity/dissimilarity between two inputs.