Not really an issue, but a realistic case from our practice that illustrates what we would like the text referencing API to cover.
In the Republic project (Resolutions of the Dutch States General from the 17th and 18th century) we process texts using the following workflow:
Retrieve text line-by-line from the PageXML format that is the output of our Handwritten Text Retrieval software and store this per Volume (roughly 'yearbook') as a list of strings
Represent all physical structure from the PageXML as web annotations referring to the lines by using list indexes of the line strings as coordinates, optionally in combination with character offsets within these line strings. Next to these text targets Republic web annotations also have standard IIIF image targets.
Concatenate text lines and 'fix' hyphenated words, thereby altering the text. The result is a list of paragraph strings.
We hereby introduced an alternative way to refer to text: refer to paragraphs using a paragraph list index plus optional character offsets.
For entity recognition (and other NLP) we need a tokenised version of the (paragraph) text. This introduces yet another way to refer to the text: token index (plus local character offset).
We explicitly keep track of mappings between these three alternative coordinate systems for (variations of) some text.
Text reference examples
URL for the text target of a semantic web annotation, returning the text of a specific Republic resolution. This specific resolution is retrieved from text on two consecutive scans.
Text of the same resolution, after concatenation of text lines, based on paragraph counts plus character offsets. In this case a list of just one paragraph.
Not really an issue, but a realistic case from our practice that illustrates what we would like the text referencing API to cover.
In the Republic project (Resolutions of the Dutch States General from the 17th and 18th century) we process texts using the following workflow:
Retrieve text line-by-line from the PageXML format that is the output of our Handwritten Text Retrieval software and store this per Volume (roughly 'yearbook') as a list of strings
Represent all physical structure from the PageXML as web annotations referring to the lines by using list indexes of the line strings as coordinates, optionally in combination with character offsets within these line strings. Next to these text targets Republic web annotations also have standard IIIF image targets.
Analyse the text automatically and add collection specific derived structure and entities as semantic web annotations, referring to the same line based coordinates. Again, these semantic web annotations also contain image targets. Example of a web annotation of type Resolution: https://annorepo.republic-caf.diginfra.org/w3c/republic-2024.01.19/3bf40e38-b8eb-40fa-9d59-5c4b58aa8063.
Concatenate text lines and 'fix' hyphenated words, thereby altering the text. The result is a list of paragraph strings.
We hereby introduced an alternative way to refer to text: refer to paragraphs using a paragraph list index plus optional character offsets.
For entity recognition (and other NLP) we need a tokenised version of the (paragraph) text. This introduces yet another way to refer to the text: token index (plus local character offset).
We explicitly keep track of mappings between these three alternative coordinate systems for (variations of) some text.
Text reference examples
URL for the text target of a semantic web annotation, returning the text of a specific Republic resolution. This specific resolution is retrieved from text on two consecutive scans.
https://textrepo.republic-caf.diginfra.org/api/view/versions/51d8c3b2-100f-4bbe-8c43-9b0fc2ef83d3/segments/index/45880/45907 (line based coordinates)
Text of the same resolution, after concatenation of text lines, based on paragraph counts plus character offsets. In this case a list of just one paragraph.
URL of the text of a 'named entity annotation', in paragraph text.
The same named entity text, but now in the original line based text plus coordinates.