Orthographic similarity

StephanAkkerman / FluentAI

Automating language learning with the power of Artificial Intelligence. This repository presents FluentAI, a tool that combines Fluent Forever techniques with AI-driven automation. It streamlines the process of creating Anki flashcards, making language acquisition faster and more efficient.

https://akkerman.ai/FluentAI/

MIT License

9 stars 1 forks source link

Orthographic similarity #15

Closed StephanAkkerman closed 1 month ago

StephanAkkerman commented 4 months ago

http://germel.dyndns.org/psyling/pdf/2008_Yarkoni_Balota_Yap_OLD20.pdf -> https://github.com/stephantul/old20

StephanAkkerman commented 2 months ago

Currently, three methods implemented:

Difflib
Fuzzy
Levenshtein

StephanAkkerman commented 1 month ago

Similarity between 'train' and 'brain' using 'difflib': 80.0 Similarity between 'train' and 'brain' using 'rapidfuzz_ratio': 80.0 Similarity between 'train' and 'brain' using 'rapidfuzz_partial_ratio': 88.88888888888889 Similarity between 'train' and 'brain' using 'damerau_levenshtein': 80.0 Similarity between 'train' and 'brain' using 'levenshtein': 80.0

StephanAkkerman commented 1 month ago

Decide what is the optimal method, using this dataset: https://www.frontiersin.org/journals/education/articles/10.3389/feduc.2023.1225169/full

StephanAkkerman commented 1 month ago

=== Evaluation Results === Method Pearson Correlation Pearson p-value Spearman Correlation Spearman p-value difflib 0.824475 4.826376e-118 0.821688 1.362026e-116 rapidfuzz_ratio 0.827050 2.095209e-119 0.821352 2.028354e-116 rapidfuzz_partial_ratio 0.695986 1.822764e-69 0.671505 4.816936e-63 damerau_levenshtein 0.854752 1.153570e-135 0.840769 4.560534e-127 levenshtein 0.853586 6.493697e-135 0.840757 4.634892e-127

=== Conclusion === The best orthographic similarity method is damerau_levenshtein. Pearson Correlation: 0.8548 (p-value: 1.15e-135) Spearman Correlation: 0.8408 (p-value: 4.56e-127)