UAlbertaALTLab / korp-frontend

Frontend for Korp, a frontend for the IMS Open Corpus Workbench (CWB).
https://spraakbanken.gu.se/en/tools/korp
MIT License
1 stars 1 forks source link

incorporate Bloomfield's texts #24

Open dwhieb opened 3 years ago

dwhieb commented 3 years ago

Add Bloomfield's texts as an additional corpus.

dwhieb commented 3 years ago

@katieschmirler informs me these are ready for import into Korp!

aarppe commented 3 months ago

@fbanados A Korp version of the Bloomfield texts can be found here: altlab/crk/generated/bloomfield_fst+cg+gloss.vrt

This is created with the following invocation:

cat corpora/bloomfield.korp-vrt | gawk '{ if(match($0, ".+~$")!=0) sub("~$",""); print; }' | bin/fst-cg-analyze-vrt.sh analyser-gt-strict.hfstol /Users/arppe/gt/lang-crk/src/cg3/disambiguator.cg3 analyser-gt-relaxed.hfstol /Users/arppe/gt/lang-crk/src/cg3/functions.cg3 generator-gt-strict.hfstol | bin/vrt2korp.sh > generated/bloomfield_fst+cg+gloss.vrt

This is largely the same as the Ahenakew-Wolfart corpus, except it has only three levels: <corpus>, <subcorpus> (2 values), and <text> (the tens of individual texts). The lang field is defined at the corpus level, whereas in the A-W corpus that is defined for each text, which may need changing.

I probably would eventually want to make more use of the underlying XML sources (e.g. the word-specific as well as sentence-specific translations, which would add fields to the linguistic analyses), but incorporating this could be a good start.