Closed StephanAkkerman closed 1 month ago
Could also use word2vec: https://www.geeksforgeeks.org/python-word-embedding-using-word2vec/
Need to add something to evaluate them: https://en.wikiversity.org/wiki/Word_similarity_dataset https://github.com/MohamedAliHadjTaieb/Semantic-measure-assessment-review-study?tab=readme-ov-file#datasets
Wordsim-353: https://gabrilovich.com/resources/data/wordsim353/wordsim353.html Simlex-999: https://fh295.github.io/simlex.html SimVerb-3500: https://aclanthology.org/D16-1235/
See this for other models; https://aclweb.org/aclwiki/WordSimilarity-353_Test_Collection_(State_of_the_art)
Evaluation Results: method pearson_corr spearman_corr 0 glove 0.234448 0.226851 1 fasttext 0.394658 0.376459 2 minilm 0.382895 0.370939 3 spacy 0.148808 0.144997
Conclusion: The best performing model is 'fasttext' with a Pearson correlation of 0.3947 and a Spearman correlation of 0.3765.
For the pipeline we want to be able to compare the phonetic word with the English translation of the foreign word.
Embeddings: