Alternative for Cosine Similarity

etherlabsio / ai-engine

Core AI services and functions powering the ETHER Platform

MIT License

0 stars 0 forks source link

Alternative for Cosine Similarity #82

Open vdpappu opened 5 years ago

vdpappu commented 5 years ago

Currently, we use Cosine Similarity for similarity metric. With complex architectures like BERT, it may not be effective as the objective functions used for pre-training or fine-tuning does not directly reflect sentence relatedness without labelled dataset. Siamese Network (https://www.youtube.com/watch?v=6jfw8MuKwpI) provides an effective alternative for similarity tasks. This involves training a small network whose outputs highlights the similarity/dissimilarity between two inputs.

vdpappu commented 5 years ago

Siamese networks need supervised samples and generating semi-supervised samples through semi-supervised approach is not feasible. An alternative approach is to calculate the similarity between noun-phrases and/or useful verbs in the sentences reduces noise by not considering the contributions from filler words.

Initial approach: https://github.com/etherlabsio/hinton/tree/experiment/sentence_relatedness/sentence_relatedness

Few issues to address:

Currently, the maximum similarity between cross-sentence noun phrases is considered as relatedness score - magnifies the similarity between two irrelevant tokens. This can be improved through mean pooling and/or phrase ranking
Exploring features from intermediate layers where the tokens attend to themselves instead of / tokens could give more reliable scores

vdpappu commented 5 years ago

Analyzing key-phrase scores across different layers to get the layer combination that gives stable results

vdpappu commented 4 years ago

worked on Candidate KP based similarity - https://github.com/etherlabsio/hinton/tree/key-phrase_scorer Currently benchmarking on the opensource dataset https://www.kaggle.com/c/quora-question-pairs to quantify the performance gains.

GitHub
Build software better, together
GitHub is where people build software. More than 36 million people use GitHub to discover, fork, and contribute to over 100 million projects.

Quora Question Pairs
Can you identify question pairs that have the same intent?

vdpappu commented 4 years ago

QQP is not the right dataset for cosine similarity. Will be working on analyzing the results on: http://alt.qcri.org/semeval2014/task3/index.php?id=data-and-tools

Data < SemEval-2014 Task 3

shashankpr commented 4 years ago

Another dataset which is used for benchmarking similarity tasks - http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark

STSbenchmark - stswiki