UAlbertaALTLab / crk-db

Managing the Plains Cree dictionary database
https://itwewina.altlab.app/
GNU General Public License v3.0
0 stars 3 forks source link

Document steps for updating LEXC source for FSTs #108

Open aarppe opened 11 months ago

aarppe commented 11 months ago

Following are the individual steps needed to update the LEXC source that will be used for the itwêwina (and other) FSTs (for which the compilation is outlined in #109).

  1. Update Cree Words (CW) source file CreeDict-x in Carleton repo

    • svn up
  2. Remove Windows-style CR characters from CW source, and copy this over to ALTLab repo

    • cat PlainsLexUni/CreeDict-x | tr -d '\r' > altlab/crk/dicts/Wolvengrey_altlab.toolbox
  3. Convert this Toolbox file into TSV format:

    • cat altlab/crk/dicts/Wolvengrey_altlab.toolbox | altlab/crk/bin/toolbox2tsv.sh > altlab/crk/generated/Wolvengrey_altlab.tsv
  4. Compare against Maskwacîs Dictionary content, and add unique entries (and associated stem and inflectional class information) after the CW entries:

    • altlab/crk/bin/add-md-entries-2-after-cw-tsv.sh altlab/crk/generated/Wolvengrey_altlab.tsv altlab/crk/dicts/Maskwacis_altlab.tsv > altlab/crk/generated/altlab.tsv
  5. Generate LEXC source for individual parts-of-speech from this ALTLab aggregated TSV file:

    • cat altlab/crk/generated/altlab.tsv | altlab/crk/bin/altlab2lexc.sh 'N' > altlab/crk/generated/noun_stems.lexc
    • cat altlab/crk/generated/altlab.tsv | altlab/crk/bin/altlab2lexc.sh 'V' > altlab/crk/generated/verb_stems.lexc
  6. Add copyright headers to LEXC sources, and copy over giellalt/lang-crk/src/fst/morphology/stems/

    • cat giellalt/lang-crk/src/fst/morphology/stems/noun_header.lexc altlab/crk/generated/noun_stems.lexc > giellalt/lang-crk/src/fst/morphology/stems/noun_stems.lexc
    • cat giellalt/lang-crk/src/fst/morphology/stems/verb_header.lexc altlab/crk/generated/verb_stems.lexc > giellalt/lang-crk/src/fst/morphology/stems/verb_stems.lexc
aarppe commented 11 months ago

There's now a shell script that does in one go all the above steps: altlab/crk/bin/update-crk-dictionary-sources-2-lexc.sh.

aarppe commented 10 months ago

@M1Al3x The process outlined above to update the LEXC source, and thus the FSTs, needs to be done first, when incorporating updated dictionary content into itwêwina, before these same dictionary sources are processed into *.importjson for uploading into itwêwina. Thus, the steps are:

  1. Update dictionary sources
  2. Update LEXC sources based on updated dictionary sources
  3. Compile new FSTs in giellalt/lang-crk
  4. Aggregate and process dictionary sources into *.importjson for uploading to the intelligent dictionary
  5. Update the internal database for the intelligent dictionary, including whatever generation of forms in paradigms or English translation equivalents.

@M1Al3x This issue describes steps 1-2 above. The front page Readme.Md has the description for steps 4-5 above.