Semantic similarity - Githubissues

StephanAkkerman / FluentAI

Automating language learning with the power of Artificial Intelligence. This repository presents FluentAI, a tool that combines Fluent Forever techniques with AI-driven automation. It streamlines the process of creating Anki flashcards, making language acquisition faster and more efficient.

https://akkerman.ai/FluentAI/

MIT License

9 stars 1 forks source link

Semantic similarity #14

Closed StephanAkkerman closed 1 month ago

StephanAkkerman commented 4 months ago

For the pipeline we want to be able to compare the phonetic word with the English translation of the foreign word.

Embeddings:

StephanAkkerman commented 4 months ago

Could also use word2vec: https://www.geeksforgeeks.org/python-word-embedding-using-word2vec/

StephanAkkerman commented 4 months ago

https://www.sbert.net/examples/applications/semantic-search/README.html https://huggingface.co/sentence-transformers/all-mpnet-base-v2 https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

StephanAkkerman commented 1 month ago

Need to add something to evaluate them: https://en.wikiversity.org/wiki/Word_similarity_dataset https://github.com/MohamedAliHadjTaieb/Semantic-measure-assessment-review-study?tab=readme-ov-file#datasets

StephanAkkerman commented 1 month ago

Wordsim-353: https://gabrilovich.com/resources/data/wordsim353/wordsim353.html Simlex-999: https://fh295.github.io/simlex.html SimVerb-3500: https://aclanthology.org/D16-1235/

StephanAkkerman commented 1 month ago

See this for other models; https://aclweb.org/aclwiki/WordSimilarity-353_Test_Collection_(State_of_the_art)

StephanAkkerman commented 1 month ago

Evaluation Results: method pearson_corr spearman_corr 0 glove 0.234448 0.226851 1 fasttext 0.394658 0.376459 2 minilm 0.382895 0.370939 3 spacy 0.148808 0.144997

Conclusion: The best performing model is 'fasttext' with a Pearson correlation of 0.3947 and a Spearman correlation of 0.3765.