Experiment with other English US IPA dataset - Githubissues

StephanAkkerman / FluentAI

Automating language learning with the power of Artificial Intelligence. This repository presents FluentAI, a tool that combines Fluent Forever techniques with AI-driven automation. It streamlines the process of creating Anki flashcards, making language acquisition faster and more efficient.

https://akkerman.ai/FluentAI/

MIT License

9 stars 1 forks source link

Experiment with other English US IPA dataset #28

Closed StephanAkkerman closed 1 month ago

StephanAkkerman commented 1 month ago

A second option would be to use https://github.com/open-dict-data/ipa-dict/blob/master/data/en_US.txt, which is also used by the G2P model. We should see if this improves eval performance:

With fallback: Phonetic Similarity Evaluation Results: method pearson_corr spearman_corr panphon 0.623008 0.618722 clts 0.400138 0.400633

Without: Phonetic Similarity Evaluation Results: method pearson_corr spearman_corr panphon 0.542018 0.537695 clts 0.610213 0.605076

StephanAkkerman commented 1 month ago

The current eng_latn_us_broad.tsv has 77k rows and en_US has 125k rows. This could increase eval results but decrease performance.

StephanAkkerman commented 1 month ago

This will only improve search for top x words

StephanAkkerman commented 1 month ago

With the new dataset & fallback: Phonetic Similarity Evaluation Results: method pearson_corr spearman_corr panphon 0.418678 0.474388 clts 0.353249 0.357871

StephanAkkerman commented 1 month ago

without fallback: Phonetic Similarity Evaluation Results: method pearson_corr spearman_corr panphon 0.599606 0.604186 clts 0.515782 0.491983

StephanAkkerman commented 1 month ago

When only using the G2P model: Phonetic Similarity Evaluation Results: method pearson_corr spearman_corr panphon 0.376618 0.444363 clts 0.277380 0.355129

StephanAkkerman commented 1 month ago

After scaling with minmax scaler:

eng latn us broad: Phonetic Similarity Evaluation Results: method pearson_corr spearman_corr panphon 0.359480 0.382383 clts 0.397452 0.417966

Phonetic Similarity Evaluation Results: method pearson_corr spearman_corr panphon 0.418678 0.474388 clts 0.353249 0.357871

en_us has better perf, but lower speed