Closed balmas closed 4 years ago
@michaelgursky I'm wondering what you think about this
In the original approach, we had a workflow and data model that was something like the following:
TEI XML Document(s) -> Word Tokens -> Alignment Document -> Alignment activity (using Alignment Document as the base model)
I'm wondering instead about a workflow and data model that looks like:
TEI XML Document(s) -> Alignment Activity (using the TEI XML document as the base model) -> Word Tokens -> Alignment Document
We would still have to extract the word tokens from the TEI XML for the alignment data, and we would have to decide whether to extract them all upfront or if we wanted to just extract them at the point of selection.
I have no experience with TEI and my question might be a silly one. From what I've read so far, TEI seems to be a very powerful markup language that can be used to structure textual information in many different ways. I'm wondering if we can use TEI to also designate token borders and mark them up within an existing TEI document, having this in an addition to the existing TEI markup? That would allow us to keep token info along with original texts and not to worry about having either one or the other.
It seems that tokenization can be a resource and time consuming process. I'm wondering if it makes sense to pre-tokenize the texts and store the token info along with the original texts in the same file. That would allow to avoid multiple tokeinzation of the same text fragment if the latter is accessed by different users over time: once tokenized, the text fragment can be stored for future reuse. We will not waste time tokenizing it again. And if token info can be stored along with the TEI markup (or maybe in parallel to it in a different data structure within the same document), we can rewrite an original document with its tokenized version and will lose no data in a process.
We can do it lazily. If someone accesses a text fragment and it is not tokenized yet, we tokenize it. Then we update an original document by adding tokenization info of the aforementioned fragment. Once we repeat it for other fragments accessed by users, will have a document with higher and higher degree of tokenization over time. This way we'll also guarantee that the most accessed parts will be tokenized first. That seems like the most rational approach to me.
I'm not sure if it all is feasible or even makes sense; I might be missing something. But at first glance, this seems to be the most practical approach to tokenization, at least to me. What do you think?
E.g. I am thinking about a data model like the following:
Where the AlignedGroups are created dynamically based upon user selection at the time of alignment, rather than created from a document at the time the alignment is started.
An AlignedGroup would be first created when a user selects a Word to align in a text. The set of Words from a single text in an AlignedGroup would belong to a WordSet. A Word would have one or more Selectors which have the text as a Source.
When a user adds a Word to an Aligned Group from a 2nd (or 3rd, etc.) text, the Words in from the other WordSets in the AlignedGroup would get an isAlignedTo relationship to it.
In general, I think it will be simpler and more efficient to keep everything in TEI XML and convert it to the web-compatible format for displaying. In that case the TEI XML document would be a single source of truth. I think that makes management of data way much easier.
I see only two drawback of a solution like that so far: (1) Larger files. But probably gzip compression, HTTP/2, and progress in connection speeds (not sure how much the latter is the case in the real life) will mitigate this. (2) Greater resources required to convert from TEI XML to a display-friendly format on the fly (I'm referring to what CETEIcean is doing). But hopefully with more powerful modern CPUs and cheaper memory (on both desktop and mobile) this will be a much smaller issue than (1).
It seems that tokenization can be a resource and time consuming process. I'm wondering if it makes sense to pre-tokenize the texts and store the token info along with the original texts in the same file. That would allow to avoid multiple tokeinzation of the same text fragment if the latter is accessed by different users over time: once tokenized, the text fragment can be stored for future reuse. We will not waste time tokenizing it again. And if token info can be stored along with the TEI markup (or maybe in parallel to it in a different data structure within the same document), we can rewrite an original document with its tokenized version and will lose no data in a process.
What I'm questioning is the premise that we even really want to pretokenize the texts before a user has made a selection from them for alignment. We know these things about tokenization:
Whatever tokenization algorithm we use, the user will likely need to make edits to it after the fact, due in part to personal preference and in part to inconsistencies and errors in the digitization of the the text (particularly if it was digitized via OCR).
It can be a resource intensive effort for large documents.
While in theory we could pretokenize before alignment and store for reuse, in practice the likelihood that user a will use the exact same version of a text as user b is small.
So what I'm wondering is what the drawbacks are to doing a just-in-time tokenization as the user selects words for alignment.
The original version of the alignment editor was initially designed to operate on the results of automatically aligned texts. So sections of text and word tokens to be aligned were already pre-calculated before the alignment process started. This actually created a number of difficulties, because often the automatically identified "sentence" chunks were not quite right, or the words needed to be split or merged after they had already been aligned to other words.
However, the new requirements are expanded, and although we do still need to be able to ingest pre-aligned text, we also expressly need to support alignments where the the boundaries of sentences and words are not predefined.
A couple of small questions:
... the AlignedGroups are created dynamically based upon user selection at the time of alignment ...
What is the minimal set of words that could be aligned? I was under (probably wrong) impression that it is a sentence, or a clause. Could it be anything smaller than that?
The second one: I've read that TEI can always be converted into a tree. Maybe we can just convert it to an in-memory AST and then add our specific data to its leafs and branches as necessary?
What is the minimal set of words that could be aligned? I was under (probably wrong) impression that it is a sentence, or a clause. Could it be anything smaller than that?
Technically, the minimal set of words that can be aligned is 1 word in a source text and 1 word in a translation.
In practice, this is likely to be a logical grouping such as a stanza (e.g. of a poem), a chapter (e.g. of a book) and so on. It could be a sentence, but the use of "sentence" as the parent for a group of aligned words in the prior data model was misleading because it is only in very literal translations that you can count on sentences being a 1-for-1 match (and some texts of course, might not be expressed in complete sentences)
So what I'm wondering is what the drawbacks are to doing a just-in-time tokenization as the user selects words for alignment.
The only ones that I see right now is that if it will require significant amount of CPU time and memory then that will place a burden too heavy on user's device and will lead to bad UX. Other than that, it's only the advantages to me.
I've read that TEI can always be converted into a tree. Maybe we can just convert it to an in-memory AST and then add our specific data to its leafs and branches as necessary?
The main advantage to me operating on the original TEI Structure, such as with the CETEICean model, would be that the alignment display of the text could closely mirror a reading display, because all of the structural information would be retained.
This is at the same time, a potential drawback however, because TEI markup is greatly varied from text to text, corpus to corpus, etc. So we might have a very difficult time coming up with a general CSS stylesheet that worked well in all cases and it would mean that how the Alpheios alignment display looked would differ, potentially quite substantially, from text to text.
So what I'm wondering is what the drawbacks are to doing a just-in-time tokenization as the user selects words for alignment.
The only ones that I see right now is that if it will require significant amount of CPU time and memory then that will place a burden too heavy on user's device and will lead to bad UX. Other than that, it's only the advantages to me.
It could be a burden if we had to tokenize the entire text in order to make a single selection. But that isn't what we do with the Alpheios Reading Tools right now. I.e. we identify each word only when it is selected.
On the other hand, trying to use the untransformed TEI XML document as the base of the alignment interface could make it difficult to properly highlight the aligned words. We'd probably have to insert nodes around each selected word and it could be problematic to recreate from a saved version -- the W3C selectors can be used for that but in my experience having selectors on many nodes in a document does not perform well.
So, some information here about how I have handled TEI XML and tokenization previously. The V1 Alignment Editor did not itself have any capacity to tokenize or transform text, it expected the input to already adhere to the Alpheios XML Alignment schema, in which "sentences" and "words" were already identified.
For Perseids, we built a client application to sit it front of it, which offered the user various options for inputting or retrieving text. That was a Flask app, which used a set of Javscript libraries to manage the text retrieval workflow. The code for the Flask app is here https://github.com/perseids-project/perseids-client-apps and the Javascript packages are capitains-sparrow, capitains-sparrow.xslt, capitains-sparrow.service. This in turn relied on a Ruby service which handled the tokenization.
the workflow for TEI XML was essentially as follows:
Pulling the text content from a TEI XML document is tricky, because TEI is a loose standard and different editors make different choices about tags to use. Often times there are tags that contain editorial or structural information that isn't part of the actual text that a user would want to align. In the Perseids implementation we used this code:
/**
* Get the text, removing nodes if necessary. if the instance has the text.property set, returns it.
*
* @function
* @memberOf CTS.text.Passage
* @name getText
*
* @param {?Array.<string>} removedNodes List of nodes' tagname to remove
* @param {?boolean} strip If true, strip the spaces in the text
*
* @returns {string} Instance text
*/
var _getText = function(removedNodes, strip) {
var xml = this.document,
text;
if(this.text !== null) {
text = (strip === true) ? trim(this.text) : this.text;
return text;
}
if(typeof removedNodes === "undefined" || removedNodes === null) {
removedNodes = ["note", "bibl", "head"];
}
removedNodes.forEach(function(nodeName) {
var elements = xml.getElementsByTagNameNS("*", nodeName);
while (elements[0]) elements[0].parentNode.removeChild(elements[0]);
});
text = (xml.getElementsByTagNameNS("*", "text")[0] || xml.getElementsByTagNameNS("*", "body")[0]).textContent;
return (strip === true) ? trim(text) : text;
}
And we allowed the user to specify a list of TEI XML element names to exclude from the text content that was retrieved. The default was teiHeader,head,speaker,note,ref
. This worked reasonably well, but sometimes extraneous content made its way in.
For tokenization, we used the LLT Service, and the main tokenization library is at https://github.com/perseids-project/llt-tokenizer/tree/master/lib/llt . It has regex based rules that are specific for Latin and Greek and various options about how enclytics and punctuation are handled.
I don't want to reuse any of the Perseids code as-is because it is out of date, unnecessarily complex and at least the Ruby code can't be built from scratch anymore.
However, it does work very well, so we might want to consider writing something new based upon it.
There are also other tokenizing libraries we could use (e.g. such as those in the CLTK library at https://github.com/cltk/cltk) but none of them are perfect.
According to the discussion I see the following workflow:
it could be fully tokenized as in the V1.0 (described in previous comment)
we could split the text to words and define the number for it and when a user select the word - we find its order and create a span with token around it
But I think that creation tags as described in the first point is better because:
I think that we could create a small server-side application, that would do the following
And this result would be stored as a starting point of alignment steps
The question here - do we want to merge somehow TEI XML format and Alpheios Alignment format? if we want - then at what state?
Would it be used somewhere in TEI XML with alignments?
@balmas , what do you think?
I think where we have ended up with is this:
tokenization service is responsible for tokenizing TEI and supporting input parameters that allow the user to identify elements for line breaks, segments and other metadata
Alignment editor is responsible for presenting the user with a choice to upload TEI and to select from the various parameters supported by the tokenizer.
We will not try to retain a link between the tokens and the original TEI. We may revisit that for a future release.
For the initial release we can drop the requirement to store the original TEI but I will add it to a future wish list.
I'll close this design issue in favor of issues that implement the above plan.
This issue is to discuss the design for handling of TEI input for the source and translation text.
At the most basic level, we need to be able to extract the word tokens from the TEI XML in order to identify the targets for alignment.
In the 1.0 version of the tool, we didn't make any attempt to retain links between the original source text and the text being aligned. We extracted the tokens and left the source text behind. But we also didn't have a requirement to retain any display instructions such as line breaks as we do now.
One approach that might be worth considering with the new version is whether we want to retain TEI XML of the input text and use it to drive the display of the text in the alignment interface. The CETEIcean library (https://github.com/TEIC/CETEIcean) can be used to create a display directly from TEI XML using Web Components.
We would still have to extract the word tokens from the TEI XML for the alignment data, and we would have to decide whether to extract them all upfront or if we wanted to just extract them at the point of selection. If we extracted them at the point of selection, we could delegate to the browser selection code to identify the underlying word and not worry about tokenization ahead of time. This could have some advantages for performance on large documents. It would also allow us to more easily export an alignment as a W3C annotation on the original text. (See for example an annotation on one of the demo CETEIcean documents at https://gist.github.com/mdlincoln/cb8f92199200dac5cf502bcac5b1e828
To keep our code consistent for plain text and TEI XML input, we could also consider transforming plain text on input into a very simple TEI XML document prior to starting alignment.
Before we get too far with the prototype alignment interface, I'd like to discuss the pros and cons of this approach.