Open jamespstrachan opened 5 years ago
TF can be run as a service that can handle various queries, see documentation here. The queries return structured data along with appropriate HTML formatting (see attached image of an example server query).
It's worth discussing whether we should continue to store two separate entries of texts (i.e. in the server AND in Github). In my opinion, we should keep the primary text in a single place, where corrections and versioning can be done. We have already established a repository that can fill this purpose, see here. If all corpus data was stored in the repository, it would then get exported to Text-Fabric. The website would then query the Text-Fabric kernel when a text is requested. Using the query language of TF, such a query might look like:
dialect dialect=Barwar
text title~A\ Hundred\ Gold\ Coins
The value of this approach over SQL is that other, more sophisticated queries are made possible, assuming we have the information indexed. For instance, the following query is already possible with the current TF resource:
dialect dialect=Urmi_C
sentence
=: word text~^be-
This finds a word that starts with "be-", at the beginning of a sentence (=:
), in the Urmi_C dialect. Indentation indicates embedding. The ability to query complex linguistic embedding is the main strength of using the TF engine. This opens the door for us to be able to link linguistic patterns in the paradigms with instances in the text. The TF server can then serve up the pattern within any requested context ("within a line", "within a paragraph", "within a sentence") and display only that context. Queries are fed to the server via simple multi-line strings in Python.
One potential problem is that by storing all texts in the Github repo, one can no longer make instant corrections to the texts via the website edit interface. A few possible solutions come to mind:
.nena
formatted files from the repo (with some stylistic changes). Then, when a researcher edits a text, that change gets pushed to the Github text corpus repo, from which the TF dataset is built. In this way, the TF dataset will be released in stable versions rather than constantly updated like its source. Elsewhere in the website, for querying, or for linking grammar patterns, we rely on Text-Fabric.The second option introduces a bit more complexity, but is more powerful. The primary difference from the status quo is that changes to the text corpus would be recorded and centrally available. Whether a researcher is accessing NENA from the website or from Text-Fabric, they are using the same data.
Lots of things to think through here. I'm open to any alternative ideas!
You can judge better than me, Cody, as to what procedures would be the most suitable. From my point of view as a producer of the transcribed texts, I still prefer to start with Word with its various hotkeys, which could feed into publications.
@GeoffreyKhan I think the best solution for that side of things is indeed a Word template. The compromise would be using a special template file that has some restrictions on formatting. You would still be able to format the text as you're used to, but the restrictions would prevent unpredictable, exotic formatting. I will look into how to do this. In general, this is probably a separate (but related) issue from Text-Fabric integration.
But if we did use such a template, the upload pipeline could be like so:
MS Word template -> .nena text -> Text-Fabric resource
and the text-editing in the website could look like this:
[my edits in website] -> .nena text -> Text-Fabric resource
In short, the .nena
format would be the central place that changes are made. MS Word templates would only be a medium for easy uploading of texts by you and other researchers.
From @GeoffreyKhan: