Closed StephanAkkerman closed 1 month ago
Currently, three methods implemented:
Similarity between 'train' and 'brain' using 'difflib': 80.0 Similarity between 'train' and 'brain' using 'rapidfuzz_ratio': 80.0 Similarity between 'train' and 'brain' using 'rapidfuzz_partial_ratio': 88.88888888888889 Similarity between 'train' and 'brain' using 'damerau_levenshtein': 80.0 Similarity between 'train' and 'brain' using 'levenshtein': 80.0
Decide what is the optimal method, using this dataset: https://www.frontiersin.org/journals/education/articles/10.3389/feduc.2023.1225169/full
=== Evaluation Results === Method Pearson Correlation Pearson p-value Spearman Correlation Spearman p-value difflib 0.824475 4.826376e-118 0.821688 1.362026e-116 rapidfuzz_ratio 0.827050 2.095209e-119 0.821352 2.028354e-116 rapidfuzz_partial_ratio 0.695986 1.822764e-69 0.671505 4.816936e-63 damerau_levenshtein 0.854752 1.153570e-135 0.840769 4.560534e-127 levenshtein 0.853586 6.493697e-135 0.840757 4.634892e-127
=== Conclusion === The best orthographic similarity method is damerau_levenshtein. Pearson Correlation: 0.8548 (p-value: 1.15e-135) Spearman Correlation: 0.8408 (p-value: 4.56e-127)
http://germel.dyndns.org/psyling/pdf/2008_Yarkoni_Balota_Yap_OLD20.pdf -> https://github.com/stephantul/old20