Open loafofpiecrust opened 3 years ago
From a sociolinguistic lens, this is a heavy topic. Since we are working with Standard language forms, and intend to use this as an educational resource, I think we are treading on thin ice. That being said, the furthest I believe we could distinguish reliably is NC vs OK broadly, and that is as far as we ought to go given our data and positioning.
After discussion, we landed on at least adding an optional textual description of the location that the document's forms were recorded in. This will help us distinguish between documents of different speech communities, without necessarily prescribing dialect (group).
For broad categorization purposes, @jgbourns has suggested indicating the apparent dialect of a particular document with consideration to who the author(s) are and their origins. This is a more basic treatment than we ultimately hope to do, but I want to gather thoughts on this now. What can we use such information for? Does including this in document metadata get us any closer to a more local consideration of the authors and spaces they occupied deeper than simply "Oklahoma" or "North Carolina"?