UAlbertaALTLab / morphodict

Plains Cree Intelligent Dictionary
https://itwewina.altlab.app/
Apache License 2.0
22 stars 11 forks source link

Create a script for generation a reverse frequency list of crk word-forms and their analyses and lemmas #1041

Closed aarppe closed 2 years ago

aarppe commented 2 years ago

For future reference, in order to standardize our process of creating a reverse frequency list of crk word-forms and their analyses and lemmas, which is used in ranking the relevance of the search results, there is now in the ALTLab repo the script: crk/bin/generate-a-w-b-wordform-lemma-anl-frequency-list.sh which does this, and can be used as follows:

crk/bin/generate-a-w-b-wordform-lemma-anl-frequency-list.sh crk/corpora ~/giellalt/lang-crk | less

The results can be stored into the following file in the ALTLab repo: crk/generated/ahenakew_wolfart_bloomfield.fst+cg.freq-sorted.txt

The script can be rerun whenever substantial changes have been implemented in the crk FST, or if we want to add other subcorpora than the currently included Ahenakew-Wolfart and Bloomfield texts.