Open dwhieb opened 3 years ago
@katieschmirler informs me these are ready for import into Korp!
@fbanados A Korp version of the Bloomfield texts can be found here: altlab/crk/generated/bloomfield_fst+cg+gloss.vrt
This is created with the following invocation:
cat corpora/bloomfield.korp-vrt | gawk '{ if(match($0, ".+~$")!=0) sub("~$",""); print; }' | bin/fst-cg-analyze-vrt.sh analyser-gt-strict.hfstol /Users/arppe/gt/lang-crk/src/cg3/disambiguator.cg3 analyser-gt-relaxed.hfstol /Users/arppe/gt/lang-crk/src/cg3/functions.cg3 generator-gt-strict.hfstol | bin/vrt2korp.sh > generated/bloomfield_fst+cg+gloss.vrt
This is largely the same as the Ahenakew-Wolfart corpus, except it has only three levels: <corpus>
, <subcorpus>
(2 values), and <text>
(the tens of individual texts). The lang
field is defined at the corpus
level, whereas in the A-W corpus that is defined for each text, which may need changing.
I probably would eventually want to make more use of the underlying XML sources (e.g. the word-specific as well as sentence-specific translations, which would add fields to the linguistic analyses), but incorporating this could be a good start.
Add Bloomfield's texts as an additional corpus.