Integrating text corpus with upload tool

codykingham commented 3 years ago

@jamespstrachan has done a lot of work in https://github.com/CambridgeSemiticsLab/nena to build a text upload tool that uses the standards outlined in this repo. That tool makes possible the following pipeline:

upload tool -> NENA format parser (here) -> nena.json files -> nena.tf files -> text searcher tool

However this pipeline has not yet been fully integrated. Namely, the code used for parsing the NENA markdown stored here under /parsing needs to be properly modularized and delivered to James for inclusion into the website backend so that we get a good stream of processed JSON files. Secondly, we need to work out a reliable way to push new texts to this repo from the uploading process. Should that be automatic? Should it go through an editorial process? Who decides when a version gets pushed to Github and how?

jamespstrachan commented 3 years ago

Currently, while the corpus transcriptions and translations are stored in nena format (ex-headers) in db there's actually not yet a way to view a compiled .nena file! This is easy enough to do but I suppose we should wait to more fully spec the pipeline before we decide on this presentation.

From my perspective, it seems sensible to keep any notion of triggering/submission/approval within the corpus itself. We already have a user system with permissions and as it's where the texts are first entered it seems a natural point to kick off the TF work.

It seems likely that the triggering (let's leave approval or versioning for a future time) will be manual once the transcriber is happy with a change they have made to the text. This would give us an opportunity to trigger some script on the server to run. My first guess is that this should be a push of the raw file content to some api of the parser. I'm not sure quite what the best way of doing this is as I'm working under the assumption that your parser won't be web facing and I've no experience at making co-resident apps ping each other on the same machine.

Off the top of my head, the APIs I'd need would include:

"Take this .nena file" (probably doesn't do the work synchronously but returns some token we can hold to know is has been received and scheduled for parsing) (perhaps that's just text_id and we tell you)
"Take this updated .nena file", same as above but we also pass the old token in so you know it's a new version of an existing text
"Search for this term: [TF format query string], returns some big exciting json packet for website to parse and render, and maybe a "still processing" message if there are unprocessed submissions to the index.

What others do you reckon we need?

dirkroorda commented 3 years ago

@jamespstrachan @codykingham

I am curious for the picture in which I can see the pipeline that Cody sketched above, and the pipeline that must exist between an incoming new text and its representation on the website.

Particularly interesting is whether the flow from new text to web inserts some new information (such as numbers) after the point where the tf pipeline by Cody has picked up the new input. Because that info will be nowhere in TF. Conversely, the TF pipeline may generate stuff that the website knows nothing about.

We must make sure that the numbering/identifiaction of texts and lines occurs early enough, so that both the TF pipeline and the website can pick it up.

dirkroorda commented 3 years ago

For the search interface to be updated, this has to be done:

Generate updated TF data, put it in a new version, in nena_tf/tf/x.y.z. Then commit nena_tf. The new version must be passed to a script in annotation/app-nena, and it will make json data out of that corpus and configure the search interface. Then the script commits app-nena, and publishes its site directory to GitHub Pages, and then the search interface is available at https://annotation.github.io/app-nena/.

dirkroorda commented 3 years ago

@jamespstrachan We do not have a plan where the website passes a TF-Query to a TF-kernel which then delivers the results which then must be shown on the website. Instead we have a loose connection, between the search tool and the website. The connection is via hyperlinks and the fact that both website and search interface are based on the same data.

codykingham commented 3 years ago

A few notes to clear some things up for both:

@dirkroorda, I believe last time @jamespstrachan and I had a full discussion about this, you and I were looking at integrating TF Query engine as an online app. So that is the cause of the misunderstanding here with the querying bit.

Another thing James and I discussed was what the ideal pipeline should look like. We agreed that a public version of the data would be integral; but that repo would be read-only in order to preserve data integrity as it comes downstream.

On the technical side of linking updates to the text corpus with the TF resources:

We have 3 file formats to keep track of. We can look at these formats as serving to connect our various tasks here:

file.nena - (James' end) a plain text markup format that could be used for uploading new material, sometimes aligned with audio. The markup is checked by James' upload tool as researchers submit new text / updates to texts.
file.json - (my end), structured data from the file.nena that breaks the markup texts into lists of recognized letters, markup tags, and "spans" (for marking speakers, timestamps, and publication line numbers). This serves 2 purposes: 1. archiving the data in a standard file format, 2. providing a frictionless source for Dirk's Text-Fabric API.
file.tf - (my / Dirk's end) all of the data from file.json + some linguistic enhancements (e.g. sentence recognition). The .tf files are needed to run the search tool that Dirk has built. They also provide interested NENA researchers with a way to interact with the text corpus in a Python environment

In the end, the purpose of all of this is 1. ensure data is being input in a consistent way, 2. make data accessible to world and researchers, and 3. make data searchable for @GeoffreyKhan and team.

codykingham commented 3 years ago

On James' API needs.

I agree with letting the finalized texts be approved by the person doing the editing / uploading. A kind of "publish" button makes sense here.

It's true that my code won't be web-facing; it will just be a set of .py files that expect a string of text in .nena format. The output is JSON files. The JSON files feed the TF parser. I expect that somehow that needs to happen in the back-end, and that the results should then be pushed immediately to Github. @jamespstrachan I'm curious how this fits in from a technical standpoint, or if we need to discuss another way.

codykingham commented 3 years ago

I'm thinking a bit more about how to simplify this and prevent breakage by eliminating unnecessary dependencies.

One potentially unnecessary part is the .json files. If the .nena files are being archived online, alongside the standards to interpret them with, then there may be no need for these files.

@jamespstrachan Instead of running a bunch of back-end Python on the webserver, what if the server simply pushed the published/complete .nena files to this repository. Then the person managing the TF side of things can independently update the search tool / TF data files periodically or on a regular schedule,

GeoffreyKhan commented 3 years ago

Please note that I've already uploaded some Christian Urmi texts from my Word files, and these have been sound-aligned. Would it make sense to upload the remaining Christian Urmi texts and all the C. Barwar texts directly from Cody's clean files, correct? I would then, however, have to paste in the translations from my Word files. Or should I continue to paste the Christian Urmi and Barwar texts in from my Word files?

jamespstrachan commented 3 years ago

@GeoffreyKhan this specific use case is probably not relevant to this ticket. Cody and I are discussing over here in the website repo: CambridgeSemiticsLab/nena#56 if you'd like to contribute.

CambridgeSemiticsLab / nena_corpus

Integrating text corpus with upload tool #9