UAlbertaALTLab / crk-db

Managing the Plains Cree dictionary database
https://itwewina.altlab.app/
GNU General Public License v3.0
0 stars 3 forks source link

Document steps for creating dictionary FSTs from updated LEXC source #109

Open aarppe opened 9 months ago

aarppe commented 9 months ago

The following are explicit instructions on creating a descriptive analyzer and normative generator (with morpheme boundaries) from updated LEXC source (undertaken in #108):

  1. Create basic morphological model

If one has compiled the aggregate LEXC file, lexicon.lexc (used to be lexicon.tmp.lexc), with the regular GiellaLT compilation scheme, one can use that file as the primary source.

read lexc src/fst/morphology/lexicon.lexc
define Morphology

Otherwise, one can compile the aggregate file as follows:

cat src/fst/root.lexc src/fst/stems/noun_stems.lexc src/fst/morphology/stems/verb_stems.lexc src/fst/morphology/stems/particles.lexc src/fst/morphology/stems/pronouns.lexc src/fst/morphology/stems/numerals.lexc src/fst/morphology/affixes/noun_affixes.lexc src/fst/morphology/affixes/verb_affixes.lexc > lexicon.lexc

  1. Create basic phonological model
source src/fst/phonology.xfscript
define Phonology
  1. Create filters for removing a) word fragments and b) orthographically non-standard forms.
regex ~[ $[ "+Err/Frag" ]];
define removeFragments

regex ~[ $[ "+Err/Orth" ]];
define removeNonStandardForms
  1. Create filter to select only forms belonging to dictionary parts-of-speech.
regex $[ "+N" | "+V" | "+Ipc" | "+Pron" ];
define selectDictPOS
  1. Compose normative generator.
set flag-is-epsilon ON
regex [ selectDictPOS .o. removeNonStandardForms .o. removeFragments .o. Morphology .o. Phonology ];
save stack generator-gt-dict-norm.hfst
define NormativeGenerator
  1. Specify transcriptor to remove special morpheme boundary characters.
regex [ [ "<" | ">" | "/" ] -> 0 ];
define removeBoundaries
  1. Load in basic model for spell relaxation.
load src/orthography/spellrelax.compose.hfst
define SpellRelax
  1. Compose descriptive analyzer
regex [ selectDictPOS .o. removeFragments .o. Morphology .o. Phonology .o. removeBoundaries .o. SpellRelax ];
# regex [ NormativeGenerator .o. removeBoundaries .o. SpellRelax ];
invert net
save stack analyser-gt-dict-desc.hfst
define DescriptiveAnalyser

Normally, the necessary FSTs would be created according to the standard GiellaLT compilation configruration, with the option --enable-dicts.

fbanados commented 1 month ago

Note: Special morpheme boundary characters may need to also be removed from the Normative Generator FST.

fbanados commented 1 month ago

After discussion with @aarppe, it was established that the expected behaviour for generator FSTs should be to include special morpheme boundary characters, and it is the job of the app to discard them when irrelevant. As shown in the instructions in this thread, it is ok for analyser FSTs to drop them.