Open codykingham opened 3 years ago
I need a diagram. Where is the ur-data? How does that go into the website? What data is used to generate the TF data? How do the updates fit in?
Is there a non-web-facing program somewhere that processes updates and offers data to the website?
The Ur-data for the current TF corpus can be found under https://github.com/CambridgeSemiticsLab/nena_corpus/tree/master/sources/msdoc2html/dialects
It is a set of MS doc files converted to HTML, and then to .nena
markup format. The issue here is that some of these texts have been inserted separately into the website, long ago. The problem is that the insertions were made without any kind of strict formatting; some of the texts were dirtied with HTML tags; and for some there were other formatting issues like defunct character codes. I believe James has cleaned up much of this.
Are all of the texts in TF in the database? I don't know yet. If not, they should be added—probably via the upload tool of James, and probably by the researchers, not us.
That leaves the next step: at some point the TF data should become a fully mirrored version of everything in the database. For that to happen, we need to connect the parts of the pipeline for the first time, run it, fix any issues, and issue the first version that reflects the server-side data.
@jamespstrachan
Apart from the fundamental things: the best concrete step now would be a feature on text nodes, say textid
that provides the number that is in the url to the audio page of that text.
I mean the 93
in https://nena-staging.ames.cam.ac.uk/audio/93/
.
Then I can generate links from search results back to audio pages.
Even better: I generate https://nena-staging.ames.cam.ac.uk/audio/93/#8
if the result is on line 8, assuming that what I know in TF as the line number (line_number
) agrees with what the audio page shows as the line number..
It's my understanding that the text_id
and the line_number
are what you need to compile a return url. I imaging we'd want to provide you with a return url template string, eg "https://nena-whatever.ames.cam.ac.uk/audio/{{text_id}}/#{{line_number}}" rather than TextFabric having any knowledge of our instance.
Perhaps you can suggest a spec for passing this in to TF and I'll see how easily I can thread it from my site code through Cody's pipeline script?
Yes. The text_id
should be present as a TF feature. line_number
is already present.
The url template ends up in a config file, and can easily be updated.
Here you see the
current config
That will then become something like
webBase: https://nena-whatever.ames.cam.ac.uk/audio
webHint: Show this line on its audio page
webUrl: '{webBase}/<2>/#<3>'
<2> is the section heading for a level 2 section (a text) and <3> for a level 3 section (a line).
We also need to make sure that the configuration of sections in `otext.tf` is updated.
Instead of the line
```
@sectionFeatures=dialect,title,line_number
```
we should specify
```
@sectionFeatures=dialect,text_id,line_number
``` I have added the new config and modified Cody's pipeline to inject an instance-specific webBase
. However, when I apply the change you suggested to the config file where it appears it triggers key error in TF while building documentation, just after this point in the output:
Loading TF data and building documentation...
This is Text-Fabric 8.5.13
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html
26 features found and 0 ignored
0.00s loading features ...
| 0.02s T otype from /usr/src/app/media/nenapipelinefiles/tf
| 0.03s T oslots from /usr/src/app/media/nenapipelinefiles/tf
| 0.01s Dataset without structure sections in otext:no structure functions in the T-API
| 0.02s T full from /usr/src/app/media/nenapipelinefiles/tf
| 0.02s T lite from /usr/src/app/media/nenapipelinefiles/tf
| 0.02s T text_id from /usr/src/app/media/nenapipelinefiles/tf
| 0.02s T lite_end from /usr/src/app/media/nenapipelinefiles/tf
| 0.02s T dialect from /usr/src/app/media/nenapipelinefiles/tf
| 0.02s T line_number from /usr/src/app/media/nenapipelinefiles/tf
| 0.02s T full_end from /usr/src/app/media/nenapipelinefiles/tf
| 0.03s T text from /usr/src/app/media/nenapipelinefiles/tf
| 0.02s T text_end from /usr/src/app/media/nenapipelinefiles/tf
| 0.02s T fuzzy_end from /usr/src/app/media/nenapipelinefiles/tf
| 0.03s T fuzzy from /usr/src/app/media/nenapipelinefiles/tf
| | 0.01s C __levels__ from otype, oslots, otext
| | 0.05s C __order__ from otype, oslots, __levels__
| | 0.01s C __rank__ from otype, __order__
| | 0.09s C __levUp__ from otype, oslots, __rank__
| | 0.01s C __levDown__ from otype, __levUp__, __rank__
| | 0.02s C __boundary__ from otype, oslots, __rank__
The error:
KeyError
2521
/usr/local/lib/python3.8/site-packages/tf/core/prepare.py, line 531,
Does this make any sense to you? (apologies in advance if I'm asking questions more related to Cody's code than yours, I'm not exactly sure where the boundary is.)
Yes, there is a problem in building the section data. From what I see, Text-Fabric is loading the newly generated dataset, and precomputing data, based on the information in otext.tf.
If you can point out to me what code is running, I could dive in further.
It could be something very simple, though.
The feature text_id
will have been given some metadata to say whether its values are integers or strings.
I advice strings, because identifiers are likely to contain non-digits. I see in the config file that indeed it is declared as strings.
But the point is: the code that generates the text_id
values should generate values of the declared type.
From the error, I see that an integer value is encountered. I suspect text_id
is not written as to deliver string values.
It seems that the pipeline code that generates the text_id feature is not yet in GitHub.
I suspect that the only thing you need to do is to wrap a str()
around the expression that delivers the text_id value.
From @dirkroorda:
We need:
Some items that should be addressed alongside this:
Integration is needed between the texts in the text corpus and the texts in the website. There is currently no active connection. Therefore, we cannot assume that the texts on the website are the same. In fact, a lot of normalizing has taken place on this side of things.
All of this depends on resolving the issue I've just filed over at the NENA Corpus repo: https://github.com/CambridgeSemiticsLab/nena_corpus/issues/9