Consider how to include Textfabric searching in Nena corpus

jamespstrachan commented 5 years ago

From @GeoffreyKhan:

I had a helpful chat yesterday with Cody, who showed me some of the search facilities that are already possible with Textfabric. I am keen for a start to be made with the next step of the development of the database. As far as I can see what we need is some kind of platform whereby the search and analytical facilities of Textfabric can be run across all the texts of the database as a whole or selectively. Textfabric would also have to be integrated in some way. Also a facility for converting the texts into the clean text format (i.e. text files with symbols but not MS Word coding) that Cody has produced.

codykingham commented 5 years ago

TF can be run as a service that can handle various queries, see documentation here. The queries return structured data along with appropriate HTML formatting (see attached image of an example server query).

It's worth discussing whether we should continue to store two separate entries of texts (i.e. in the server AND in Github). In my opinion, we should keep the primary text in a single place, where corrections and versioning can be done. We have already established a repository that can fill this purpose, see here. If all corpus data was stored in the repository, it would then get exported to Text-Fabric. The website would then query the Text-Fabric kernel when a text is requested. Using the query language of TF, such a query might look like:

dialect dialect=Barwar
    text title~A\ Hundred\ Gold\ Coins

The value of this approach over SQL is that other, more sophisticated queries are made possible, assuming we have the information indexed. For instance, the following query is already possible with the current TF resource:

dialect dialect=Urmi_C
    sentence
        =: word text~^be-

This finds a word that starts with "be-", at the beginning of a sentence (=:), in the Urmi_C dialect. Indentation indicates embedding. The ability to query complex linguistic embedding is the main strength of using the TF engine. This opens the door for us to be able to link linguistic patterns in the paradigms with instances in the text. The TF server can then serve up the pattern within any requested context ("within a line", "within a paragraph", "within a sentence") and display only that context. Queries are fed to the server via simple multi-line strings in Python.

One potential problem is that by storing all texts in the Github repo, one can no longer make instant corrections to the texts via the website edit interface. A few possible solutions come to mind:

We can maintain the separation between the text corpus and the website. All changes to the text corpus must then be logged as an issue in Github and edited by whoever is managing the corpus.
We can keep a copy of the NENA corpus repository on the server. In a text browser section (currently "Audio Material") we display the data straight from the .nena formatted files from the repo (with some stylistic changes). Then, when a researcher edits a text, that change gets pushed to the Github text corpus repo, from which the TF dataset is built. In this way, the TF dataset will be released in stable versions rather than constantly updated like its source. Elsewhere in the website, for querying, or for linking grammar patterns, we rely on Text-Fabric.

The second option introduces a bit more complexity, but is more powerful. The primary difference from the status quo is that changes to the text corpus would be recorded and centrally available. Whether a researcher is accessing NENA from the website or from Text-Fabric, they are using the same data.

Lots of things to think through here. I'm open to any alternative ideas!

GeoffreyKhan commented 5 years ago

You can judge better than me, Cody, as to what procedures would be the most suitable. From my point of view as a producer of the transcribed texts, I still prefer to start with Word with its various hotkeys, which could feed into publications.

codykingham commented 5 years ago

@GeoffreyKhan I think the best solution for that side of things is indeed a Word template. The compromise would be using a special template file that has some restrictions on formatting. You would still be able to format the text as you're used to, but the restrictions would prevent unpredictable, exotic formatting. I will look into how to do this. In general, this is probably a separate (but related) issue from Text-Fabric integration.

But if we did use such a template, the upload pipeline could be like so:

MS Word template -> .nena text -> Text-Fabric resource

and the text-editing in the website could look like this:

[my edits in website] -> .nena text -> Text-Fabric resource

In short, the .nena format would be the central place that changes are made. MS Word templates would only be a medium for easy uploading of texts by you and other researchers.

CambridgeSemiticsLab / nena

Consider how to include Textfabric searching in Nena corpus #41