UAlbertaALTLab / morphodict

The Language Independent Intelligent Dictionary
https://morphodict.readthedocs.io/
Apache License 2.0
23 stars 11 forks source link

Update dictionary-based morpheme log-frequency rankings (and corpus-based lemma rankings as well) #1040

Closed aarppe closed 2 years ago

aarppe commented 2 years ago

The ALTLab repo now has a revised version of entry-specific aggregated and individual morpheme log-frequencies (along with morpheme counts), which is available in: crk/generated/CW_aggregate_morpheme_log_freqs.tsv

This was created with the script: crk/bin/extract-morpheme-frequencies.sh with the following command:

crk/bin/extract-morpheme-frequencies.sh ../PlainsLexUni/CreeDict-x > crk/generated/CW_aggregate_morpheme_log_freqs.tsv

Note that entries that occur only in MD (or any other dictionary) will not get ranked - for those entries we need to come up with some default strategy, perhaps based on character length using corresponding mean weights based on CW entry weights, or something else.

With this, we should now have all the components that the linguists can bring to the table for updating and revising the relevance ranking of the search results. Note that the corpus-based form/lemma frequencies are to be found in the ALTLab repo here:

crk/generated/ahenakew_wolfart_bloomfield.fst+cg.freq-sorted.txt

We may want to consider whether the survey results ought to be used for specifying core vocabulary. And we will need to implement POS-matching in particular between the results of English search phrase analysis and the dictionary entries.