UAlbertaALTLab / morphodict

Plains Cree Intelligent Dictionary
https://itwewina.altlab.app/
Apache License 2.0
22 stars 11 forks source link

Create a simple process for updating crk FST into itwêwina #257

Open aarppe opened 4 years ago

aarppe commented 4 years ago

Creating a more permanent solution for #256, since the crk FST will be updated based on CW updates and modifications to the affixation, we would need a stream-lined process which will update the itwêwina FSTs (descriptive analyzer and normative generator) and the content that is generated with the FSTs (paradigm content).

As part of that process, we would need some diagnostics for checking that the change won't wreck the functionality of itwêwina. Likely, the paradigms in giella/langs/crk/test/src/gt-norm-yamls.

aarppe commented 2 years ago

@andrewdotn @nienna73 I believe this has now in practice been implemented with or in conjunction with incremental import? Whatever remains might be commented here, but otherwise this and the associated issues can be considered closed?

aarppe commented 1 year ago

The following XFSCRIPT code should generate the normative generator (with morpheme boundaries) and the descriptive analyzer for crk, from the elements existing in giellalt/lang-crk/ - What is needed is the latest full lexicon, in lexicon.tmp.lexc or lexicon.hfst / lexicon.fomabin, the phonological rules, in phonology.xfscript, and the composable version of the spell-relax rules, in spellrelax.compose.hfst.

read lexc src/fst/lexicon.tmp.lexc
# load src/fst/lexicon.hfst
define Morphology

source src/fst/phonology.xfscript
define Phonology

regex ~[ $[ "+Err/Frag" ]];
define removeFragments

regex ~[ $[ "+Err/Orth" ]];
define removeNonStandardForms

regex [ 0 <- "+Err/Orth" ];
define deleteErrOrthTag

regex ~[ $[ [ "+N" | "+V" ] ?* "+Err/Orth" ]];
define removeNonStandardNounVerbForms

regex $[ "+N" | "+V" | "+Ipc" | "+Pron" ];
define selectDictPOS

set flag-is-epsilon ON
regex [ selectDictPOS .o. removeNonStandardForms .o. removeFragments .o. Morphology .o. Phonology ];
save stack generator-gt-dict-norm.hfst
define NormativeGenerator

regex [ [ "<" | ">" | "/" ] -> 0 ];
define removeBoundaries

load src/orthography/spellrelax.compose.hfst
define SpellRelax

regex [ deleteErrOrthTag .o. selectDictPOS .o. removeFragments .o. Morphology .o. Phonology .o. removeBoundaries .o. SpellRelax ];
# regex [ NormativeGenerator .o. removeBoundaries .o. SpellRelax ];
invert net
save stack analyser-gt-dict-desc.hfst
define DescriptiveAnalyser

@nienna73 This should fix at least some obvious accumulated glitches for itwêwina, adding -im- to the most obvious possessed nouns, but there is still some substantial revision I need to complete.

The above could be used to generate the dictionary versions of the two FSTs, while keeping everything the same, and starting from the shared source.

aarppe commented 1 year ago

This requires a description of the steps needed to create updated LEXC source from the various dictionary sources. This is documented here: https://github.com/UAlbertaALTLab/crk-db/issues/108