Linking text corpus with nena website

codykingham commented 3 years ago

From @dirkroorda:

We need:

a feature on texts that contains the number used to identify it on the website a feature on lines that contains the number used to identify it on the website

Some items that should be addressed alongside this:

Integration is needed between the texts in the text corpus and the texts in the website. There is currently no active connection. Therefore, we cannot assume that the texts on the website are the same. In fact, a lot of normalizing has taken place on this side of things.

All of this depends on resolving the issue I've just filed over at the NENA Corpus repo: https://github.com/CambridgeSemiticsLab/nena_corpus/issues/9

dirkroorda commented 3 years ago

I need a diagram. Where is the ur-data? How does that go into the website? What data is used to generate the TF data? How do the updates fit in?

Is there a non-web-facing program somewhere that processes updates and offers data to the website?

codykingham commented 3 years ago

The Ur-data for the current TF corpus can be found under https://github.com/CambridgeSemiticsLab/nena_corpus/tree/master/sources/msdoc2html/dialects

It is a set of MS doc files converted to HTML, and then to .nena markup format. The issue here is that some of these texts have been inserted separately into the website, long ago. The problem is that the insertions were made without any kind of strict formatting; some of the texts were dirtied with HTML tags; and for some there were other formatting issues like defunct character codes. I believe James has cleaned up much of this.

Are all of the texts in TF in the database? I don't know yet. If not, they should be added—probably via the upload tool of James, and probably by the researchers, not us.

That leaves the next step: at some point the TF data should become a fully mirrored version of everything in the database. For that to happen, we need to connect the parts of the pipeline for the first time, run it, fix any issues, and issue the first version that reflects the server-side data.

dirkroorda commented 3 years ago

@jamespstrachan Apart from the fundamental things: the best concrete step now would be a feature on text nodes, say textid that provides the number that is in the url to the audio page of that text. I mean the 93 in https://nena-staging.ames.cam.ac.uk/audio/93/. Then I can generate links from search results back to audio pages. Even better: I generate https://nena-staging.ames.cam.ac.uk/audio/93/#8 if the result is on line 8, assuming that what I know in TF as the line number (line_number) agrees with what the audio page shows as the line number..

jamespstrachan commented 3 years ago

It's my understanding that the text_id and the line_number are what you need to compile a return url. I imaging we'd want to provide you with a return url template string, eg "https://nena-whatever.ames.cam.ac.uk/audio/{{text_id}}/#{{line_number}}" rather than TextFabric having any knowledge of our instance.

Perhaps you can suggest a spec for passing this in to TF and I'll see how easily I can thread it from my site code through Cody's pipeline script?

dirkroorda commented 3 years ago

Yes. The text_id should be present as a TF feature. line_number is already present. The url template ends up in a config file, and can easily be updated. Here you see the current config

dirkroorda commented 3 years ago

That will then become something like

webBase: https://nena-whatever.ames.cam.ac.uk/audio
webHint: Show this line on its audio page
webUrl: '{webBase}/<2>/#<3>'

<2> is the section heading for a level 2 section (a text) and <3> for a level 3 section (a line). We also need to make sure that the configuration of sections in `otext.tf` is updated. Instead of the line ``` @sectionFeatures=dialect,title,line_number ``` we should specify ``` @sectionFeatures=dialect,text_id,line_number ```

jamespstrachan commented 3 years ago

I have added the new config and modified Cody's pipeline to inject an instance-specific webBase. However, when I apply the change you suggested to the config file where it appears it triggers key error in TF while building documentation, just after this point in the output:

Loading TF data and building documentation...
This is Text-Fabric 8.5.13
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

26 features found and 0 ignored
  0.00s loading features ...
   |     0.02s T otype                from /usr/src/app/media/nenapipelinefiles/tf
   |     0.03s T oslots               from /usr/src/app/media/nenapipelinefiles/tf
   |     0.01s Dataset without structure sections in otext:no structure functions in the T-API
   |     0.02s T full                 from /usr/src/app/media/nenapipelinefiles/tf
   |     0.02s T lite                 from /usr/src/app/media/nenapipelinefiles/tf
   |     0.02s T text_id              from /usr/src/app/media/nenapipelinefiles/tf
   |     0.02s T lite_end             from /usr/src/app/media/nenapipelinefiles/tf
   |     0.02s T dialect              from /usr/src/app/media/nenapipelinefiles/tf
   |     0.02s T line_number          from /usr/src/app/media/nenapipelinefiles/tf
   |     0.02s T full_end             from /usr/src/app/media/nenapipelinefiles/tf
   |     0.03s T text                 from /usr/src/app/media/nenapipelinefiles/tf
   |     0.02s T text_end             from /usr/src/app/media/nenapipelinefiles/tf
   |     0.02s T fuzzy_end            from /usr/src/app/media/nenapipelinefiles/tf
   |     0.03s T fuzzy                from /usr/src/app/media/nenapipelinefiles/tf
   |      |     0.01s C __levels__           from otype, oslots, otext
   |      |     0.05s C __order__            from otype, oslots, __levels__
   |      |     0.01s C __rank__             from otype, __order__
   |      |     0.09s C __levUp__            from otype, oslots, __rank__
   |      |     0.01s C __levDown__          from otype, __levUp__, __rank__
   |      |     0.02s C __boundary__         from otype, oslots, __rank__

The error:

KeyError
2521
/usr/local/lib/python3.8/site-packages/tf/core/prepare.py, line 531,

Does this make any sense to you? (apologies in advance if I'm asking questions more related to Cody's code than yours, I'm not exactly sure where the boundary is.)

dirkroorda commented 3 years ago

Yes, there is a problem in building the section data. From what I see, Text-Fabric is loading the newly generated dataset, and precomputing data, based on the information in otext.tf.

If you can point out to me what code is running, I could dive in further.

It could be something very simple, though. The feature text_id will have been given some metadata to say whether its values are integers or strings. I advice strings, because identifiers are likely to contain non-digits. I see in the config file that indeed it is declared as strings. But the point is: the code that generates the text_id values should generate values of the declared type.

From the error, I see that an integer value is encountered. I suspect text_id is not written as to deliver string values.

It seems that the pipeline code that generates the text_id feature is not yet in GitHub. I suspect that the only thing you need to do is to wrap a str() around the expression that delivers the text_id value.

CambridgeSemiticsLab / nena_tf

Linking text corpus with nena website #23