incorporate Bloomfield's texts

dwhieb commented 3 years ago

Add Bloomfield's texts as an additional corpus.

dwhieb commented 3 years ago

@katieschmirler informs me these are ready for import into Korp!

aarppe commented 3 months ago

@fbanados A Korp version of the Bloomfield texts can be found here: altlab/crk/generated/bloomfield_fst+cg+gloss.vrt

This is created with the following invocation:

cat corpora/bloomfield.korp-vrt | gawk '{ if(match($0, ".+~$")!=0) sub("~$",""); print; }' | bin/fst-cg-analyze-vrt.sh analyser-gt-strict.hfstol /Users/arppe/gt/lang-crk/src/cg3/disambiguator.cg3 analyser-gt-relaxed.hfstol /Users/arppe/gt/lang-crk/src/cg3/functions.cg3 generator-gt-strict.hfstol | bin/vrt2korp.sh > generated/bloomfield_fst+cg+gloss.vrt

This is largely the same as the Ahenakew-Wolfart corpus, except it has only three levels: <corpus>, <subcorpus> (2 values), and <text> (the tens of individual texts). The lang field is defined at the corpus level, whereas in the A-W corpus that is defined for each text, which may need changing.

I probably would eventually want to make more use of the underlying XML sources (e.g. the word-specific as well as sentence-specific translations, which would add fields to the linguistic analyses), but incorporating this could be a good start.

UAlbertaALTLab / korp-frontend

incorporate Bloomfield's texts #24