Closed benjamingeer closed 7 years ago
Work in progress is on the wip/standoff
branch.
I've been working on this question: if you make a diplomatic transcription of a manuscript, and then a critical text based on it, what happens when there's a new version of the transcription?
Our original idea was to avoid all redundancy in text storage, by having the transcription and the critical text contain standoff referring to the same Unicode string. However, I now think this isn't a good idea, for two reasons:
Therefore I think it's better if the editorial text has its own Unicode string. To keep the transcription's Unicode string out of full-text searches, we can make it the object of a new property, valueHasNonIndexedString
, which would be used instead of valueHasString
when the string should not be indexed for full-text search.
The next question is how to maintain the connection between the editorial text and the transcription. The user should be able to click on parts of the editorial text, and be taken to the corresponding parts of the transcription.
I think this can be done with diffs. If you diff the editorial text against the transcription, you get a list of substrings, each of which is marked to indicate whether it's same in both texts, was deleted from the editorial text, or was added to the editorial text. Each of these substrings can easily be converted into a pair of standoff ranges, one referring to the transcription and one referring to the editorial text. With this, given any position in the editorial text, you can find the corresponding position in the transcription, because there are only two possibilities:
Now, what happens when someone makes a new version of the transcription? We have three versions to consider: transcription version 1, transcription version 2, and editorial version 1. Here's an example, using XML tags to represent standoff ranges.
Transcription version 1:
<para>I took the bus <strike>today </strike>because I was was in a hurry.</para>
Editorial version 1 keeps the <para>
, deletes <strike>today</strike>
, and corrects the repeated word was
:
<para>I took the bus because I was in a hurry.</para>
Now the transcriber corrects the transcription by changing bus
to train
, correcting the repeated word (which, it turns out, isn't repeated in the manuscript), and noting a change in ink colour, to produce transcription version 2:
<para>I took the train <strike>today </strike>because I was in a <blue>hurry</blue>.</para>
Now the editor wants to merge the transcriber's changes into the editorial version, to make editorial version 2. This is similar to the three-way merge performed by git and other software version control systems. But git automatically merges changes that it sees as non-conflicting. We can't do that, because any change in a transcription could potentially change the interpretation of everything else. After seeing some particular change introduced in the new version of the transcription, the editor may need to reconsider some of his previous edits. So here's what should happen:
was
is no longer a difference, since the editor's and transcriber's versions now agree on that.)train
and <blue>
as new changes, and give the editor the option of keeping all the other previous edits from editorial version 1 (namely the deletion of today
).The editor chooses to accept train
and <blue>
and keep his previous edits. The result is editorial version 2:
<para>I took the train because I was in a <blue>hurry</blue>.</para>
There are two obstacles to implementing this:
diff3
, because it's line-based.) It would be annoying for the user to have to check each difference to figure out which ones are new.<blue>
in the example below), we can't write an algorithm to determine the position in the editorial text where that new standoff should be merged. Sometimes we could make a reasonable suggestion, but often (especially if text has been moved) we couldn't. So the user has to find the correct position manually.Instead of solving these two problems algorithmically, I suggest we deal with them by providing a user interface that makes it easier for the user to do the work manually, using two windows (or panes): a read-only window showing the transcription, and an editable window showing the editorial text. In the transcription window, we can show what changed in the transcription, both in the Unicode string and in the standoff. To show the changes in the Unicode string, we just need to diff it with the previous version. To show the changes in the standoff, we give each standoff range a unique ID (e.g. a UUID) when it's created. Then we can easily identify which ones were added and which ones were removed in the transcription. For example, we can tell the user that <blue>
was added to the transcription. Then if the user wants to merge <blue>
into the editorial text, all he has to do is figure out where to put it.
Here's how this could work:
The editor looks in the transcription window to see what was changed in the transcription. He clicks on Go to next change and sees that bus
was changed to train
. He looks for the corresponding change in the editorial text window (he has to do this visually, because we can't automate it), clicks on it to open a context menu, and selects Accept current transcription variant. The GUI replaces bus
with train
.
In the transcription window, he clicks on Go to next change again and sees that the tag <blue>
was added. He drags and drops the tag from the transcription window into the editorial window, onto the word hurry
(he has to do this manually because we can't automate it). This adds the standoff tag <blue>
to the editorial text.
Now he's finished. He has one diff left in the editorial text, the deletion of today
, but he wants to keep that. So he saves the editorial text, creating editorial version 3, which is stored with its standoff and a link to transcription 2. The diffs between editorial version 3 and transcription 2 can be generated again any time they're needed.
Suppose a user makes a text with standoff, then adds Annotation
resources that annotate the standoff resources. Then the user creates a new version of the text, with a new set of standoff resources. Some of the standoff is the same as in the old version. Can the existing annotations be made to point to the standoff resources that haven't changed?
Here's my suggestion: If every standoff resource gets a UUID when it's created, as described above, then we can make a new kind of annotation property, which points not to an IRI, but to a UUID. When a user edits standoff, we can keep the same UUIDs for any standoff tags that the user doesn't delete, and make new standoff resources with the same UUIDs. So any annotations pointing to those UUIDs will now point to standoff resources referring to the new version of the text (as well as to standoff resources referring to the old version).
There's a prototype of this on the wip/standoff
branch. Each XML tag or text node becomes a standoff range. If the XML tag has attributes, we store those with the standoff range. Then it's easy to convert back to XML and get an equivalent document.
To support custom standoff resource classes, the user will need to supply a mapping between XML names and RDF IRIs, specifically:
One option would be to upload this mapping with the XML document to be converted. Another option would be to construct the mapping as a Knora resource, so it could be reused.
An uploaded XML document could provide its own UUIDs in tags, which would become the UUIDs of the standoff nodes created for those tags. Then when the user uploaded a new version of the XML document, any annotations pointing to those UUIDs would also refer to the new version of the text.
For example, in a document in which persons are tagged with unique IDs, each <person>
tag could refer to a Person
resource in Knora. The XML could say something like:
<person person-id="http://data.knora.org/12345">Albert Einstein</person>
The user-supplied XML-to-RDF mapping would specify that the person-id
attribute of the person
tag should be converted into a Knora link to the specified resource.
An XML document could refer to existing hierarchical list nodes in Knora. For example:
<poem type="free-verse">Blah blah blah</poem>
The user-supplied XML-to-RDF mapping would specify that the type
attribute of the poem
tag should be converted into a ListValue
, and that free-verse
should be converted to the IRI of a particular ListNode
.
The implementation on the wip/standoff
branch now supports CLIX milestones, which are pairs of empty elements that look like this:
<q who="paris" sID="foo"/>...<q eID="foo"/>
Compared to other XML representations of non-hierarchical markup, CLIX has the advantages of being lightweight, unambiguous, and easy to process (any pair of empty tags with sID
and eID
attributes can be treated as CLIX tags, and no other information is needed to convert them to standoff). CLIX is recognised in the TEI/XML guidelines as an extension of the TEI, as long as it uses a distinct, non-TEI namespace.
The implementation on wip/standoff
can convert an XML document containing a mixture of hierarchical elements and CLIX markup to standoff, and convert the standoff back to the original XML document.
We will also have to support other XML representations of non-hierarchical markup in XML. In one current use case, poetry is marked up in TEI/XML using empty <milestone>
elements to indicate syllable boundaries. Each syllable is preceded by one of three milestone tags:
<milestone type="/"/>
<milestone type="\"/>
<milestone type="="/>
Each milestone is implicitly terminated by the next milestone or by another closing tag (e.g. </div>
,</l>
, or </p>
), whichever comes first.
TODO: Figure out what information we need from the user so we can process milestones like these.
After some discussion, @tobiasschweizer and I came to the conclusion that it would probably be too difficult (given the resources we have available) to implement a WYSIWYG text editor in JavaScript that supported overlapping markup. As an alternative, we could offer a non-WYSIWYG markup editor and a preview pane, which is an approach that has worked well for LaTeX. Probably the only well-supported markup syntax that's powerful enough to be used in such an editor is XML. Since we have to support XML import and export anyway, using XML in the editor would also be simpler than introducing some other kind of markup.
However, XML is cumbersome to write by hand. There are several JavaScript-based XML editor components that make things easier for the user by doing one or both of the following:
Here are two such components:
One serious problem with editing XML by hand is the difficulty of mixing it with text in right-to-left (RTL) languages such as Arabic, Hebrew, Farsi, and Urdu. Unless the editor has been specifically designed to support RTL text, the standard Unicode bidirectional text (bidi) algorithm displays punctuation marks in XML (such as <
and >
) in the wrong places, making it difficult to see where the markup actually is. Here are some descriptions of this nightmare:
The commercial XML editor Oxygen seems to deal with this problem quite well, both in text mode, where you can see the XML syntax, and in author mode, which is a tree view that hides the XML syntax (see their video demo). For Knora users working with RTL text, buying an Oxygen licence would be one option. However, this would be rather cumbersome, because text would have to be exported from Knora into Oxygen and back again, if only by copying and pasting.
In theory, it should be possible for a JavaScript-based XML editor to deal with RTL text as well as Oxygen does, using the improved bidi support introduced in HTML5 and in in Unicode 6.3. In practice, there seems to be no existing JavaScript XML editor that does this.
<span>
element, thus rendering the text illegible.This could be done by copying the transcription of each region into the critical text, and marking the copied text with a standoff tag containing a link to the original region. However, the question is how to handle a merge when the original transcription is updated. Probably the simplest approach would be to open the relevant part of the critical text in a separate editor window, proceed as described above, then copy the merged region text back into the critical text.
Something to cite when we write a paper about this:
Marilyn Deegan and Kathryn Sutherland, Transferred Illusions: Digital Technology and the Forms of Print. Routledge, 2016.
p. 87:
They go on to cite two projects, ARCHway and JITM, that could be useful points of comparison:
They also cite an article with the great title Embedded Markup Considered Harmful:
SGML advocates I have talked to appear to have the belief that everything is either sequential and hierarchical, or can be represented that way. What is not expresssible sequentially and hierarchically is deemed to be nonexistent, inconceivable, evil, or mistaken....
You can always force structures into other structures and claim that they're undamaged; another way to say this is that if you smash things up it is easier to make them fit. Enforcing sequence and hierarchy simply restricts the possibilities.
Like a TV dinner, embedded markup nominally contains everything you could want. "What else could you possibly want?" means "It's not on the menu."
What we're doing here is a bit like Just-in-Time Markup. There's a demo of JITM online but it hasn't been updated since 2004 and I can't really tell if it's working.
I did a mockup of how to use HTML to render XML markup with RTL text, and submitted an issue to CodeMirror explaining the problem and the suggested solution.
Anne-Sophie Bories is working with XML files in which every syllable is marked up. The text is thus completely unreadable. I can understand that people do this because there's no better alternative, but it's still awful. It would be much better to have a WYSIWYG editor, even if it was a standalone application.
I'm going to make some mockups and see what's feasible in different programming languages.
Notes on discussion with @tobiasschweizer:
TextValue
to have more than one standoff perspective. It's simpler to copy the text into a new TextValue
and mark up the copy, which can be linked back to its source and compared with it.R1
contains a TextValue
with a standoff link to R2
, it would be useful to show an incoming link between R1
and R2
, which is what API v1 currently does. We can't actually create such a link, for the reasons explained above, but we can probably simulate it when we query incoming links in the resources responder.addition
.After discussion with @lrosenth and @tobiasschweizer:
To limit the number of standoff resources created while a user is editing a text, we need to offer an editor that stores text with standoff in an intermediate format, e.g. as XML. The user would then be able to 'publish' a text as RDF when ready.
To deal with this issue:
If someone has annotated a standoff resource by linking to its IRI, we would have to worry about the cardinality constraints on the annotation. We could eliminate this problem if we allowed annotations on standoff only using UUIDs, which are not involved in constraint checking.
We can make two subclasses of Resource
, called LinkableResource
and NonLinkableResource
. The property hasLinkTo
will only be allowed to point to a LinkableResource
.
What to do next:
knora-base
as described above.StandoffValueV1
in ValueUtilV1
.StandoffValueV1
in the SPARQL templates.TextValueV1
or in the form used in StandoffUtil
.Much of our discussion today was about how to reduce the storage and performance overhead of RDF/standoff. Here are some ideas on how to make standoff resources more lightweight:
TextValue
. That would eliminate a lot of redundant triples in typical cases. There is a use case in which you need to share standoff but not the text that it refers to (because it's copyrighted), but this will be exceptional.StandoffValue
and include its contents directly in the resource as datatype properties.Then I think you could make a minimal standoff resource with just five predicates:
rdf:type
isTagOnText
startHasOffset
endHasOffset
isDeleted
This would require more standoff-specific Scala code (to copy permissions and to read and write the special standoff properties), but probably not much more. What do you think, @lrosenth?
Standoff resources as described above are a special kind of resource:
Value
, from which they can inherit owner, project, and permissions.Let's make this explicit by creating a subclass of Resource
called DependentResource
that has these characteristics. Knora can refuse to violate these rules when it's dealing with a DependentResource
. Then StandoffTag
can be a subclass of DependentResource
.
Someone uses Knora to write an academic paper. A lot of people want to highlight parts of it and comment on them. There's no reason for them to copy the text of the article into a new TextValue
, because they're not going to change it. How can they mark it up with their own standoff, while still keeping the principle that standoff resources are dependent on a TextValue
?
To solve this, we could allow a derived TextValue
to use the string of the base TextValue
by reference rather than by copying it. If you create a new TextValue
that's derived from an existing TextValue
(even in another resource), and its valueHasString
is identical, we can avoid storing the string redundantly:
<http://data.knora.org/1/values/1> rdf:type knora-base:TextValue ;
knora-base:valueHasString "This is the original text" .
<http://data.knora.org/2/values/1> rdf:type knora-base:TextValue ;
knora-base:isDerivedFrom <http://data.knora.org/1/values/1> ;
knora-base:valueSharesString true .
When the user creates or edits <http://data.knora.org/2/values/1>
, we can check whether they modified the original string. If so, we store the modified copy. If not, we use valueSharesString true
.
This allows the derived TextValue
to have its own dependent standoff.
unbrokenline
tags where poemType
is mixedprose
and the number of syllables within the unbroken line is between 10 and 12.person
tag with a reference to his IRI).date
tag).For one simple reason: if a text has 1000 standoff tags, querying the entire contents of 1000 resources will be too slow.
Result of conversation with @lrosenth:
A standoff tag will be something more like a Value
than a Resource
. We will make a set of more abstract data types, like DateData
, ColorData
, etc.. Then DateValue
will be a subclass of Value
and of DateData
, and StandoffDateTag
will be a subclass of StandoffTag
and of DateData
. Then it will be possible to search for a date regardless of whether it's a standoff tag or a value.
Instead of having one standoff tag class for rich-text markup, we'll have different classes for bold, italic, strikethrough, etc., to simplify queries.
Each project will be able to define its own subclasses of StandoffTag
.
XML attributes will be represented as predicates and objects on standoff tags. The user who uploads an XML file will have to provide a mapping between (XML namespace, attribute name) and the corresponding RDF property.
So we will keep the basic structure we have now: a standoff tag will be an object attached to a TextValue
. For now, it will not have permissions, owner, project, or versioning, but we could add these later if necessary.
We should decide how to represent line breaks, I thought it would be a good idea to represent them inside the text string itself so words that are in different lines do not get attached to one another.
To avoid the carriage return vs. newline problem, I suggest using the Unicode character INFORMATION SEPARATOR TWO, which is available as the constant org.knora.webapi.util.FormatConstants.INFORMATION_SEPARATOR_TWO
. Historically it's a record separator, for separating things like rows in a table.
Yes, you are right! I remember the problems we had because these characters got converted.
I will have to check if this is easy to do in the GUI, so the text would already be correct.
@benjamingeer
This is a first attempt:
### http://www.knora.org/ontology/knora-base#StandoffTag
:StandoffTag rdf:type owl:Class ;
rdfs:subClassOf [ rdf:type owl:Restriction ;
owl:onProperty :standoffHasStart ;
owl:cardinality "1"^^xsd:nonNegativeInteger
] ,
[ rdf:type owl:Restriction ;
owl:onProperty :standoffHasEnd ;
owl:cardinality "1"^^xsd:nonNegativeInteger
] ;
rdfs:comment "Represents a standoff markup tag in a TextValue"@en .
### http://www.knora.org/ontology/knora-base#StandoffHrefTag
:StandoffHrefTag rdf:type owl:Class ;
rdfs:subClassOf :StandoffTag ,
[ rdf:type owl:Restriction ;
owl:onProperty :standoffHasHref ;
owl:cardinality "1"^^xsd:nonNegativeInteger
] ;
rdfs:comment "Represents a hyperlink in a TextValue"@en .
### http://www.knora.org/ontology/knora-base#StandoffLinkTag
:StandoffLinkTag rdf:type owl:Class ;
rdfs:subClassOf :StandoffTag ,
[ rdf:type owl:Restriction ;
owl:onProperty :standoffHasLink ;
owl:cardinality "1"^^xsd:nonNegativeInteger
] ;
rdfs:comment "Represents a reference to a Knora resource in a TextValue"@en .
### http://www.knora.org/ontology/knora-base#StandoffVisualTag
:StandoffVisualTag rdf:type owl:Class ;
rdfs:subClassOf :StandoffTag ;
rdfs:comment "Represents markup information needed to render a TextValue"@en .
### http://www.knora.org/ontology/knora-base#StandoffParagraphTag
:StandoffParagraphTag rdf:type owl:Class ;
rdfs:subClassOf :StandoffVisualTag ;
rdfs:comment "Represents a paragraph in a TextValue"@en .
### http://www.knora.org/ontology/knora-base#StandoffDataTypeTag
:StandoffDataTypeTag rdf:type owl:Class ;
rdfs:subClassOf :StandoffTag ;
rdfs:comment "Represents a value type in a TextValue"@en .
### http://www.knora.org/ontology/knora-base#StandoffDateValueTag
:StandoffDateValueTag rdf:type owl:Class ;
rdfs:subClassOf :StandoffDataTypeTag,
:DateBase ;
rdfs:comment "Represents a DateValue in a TextValue"@en .
So now when we query the class StandoffDateValueTag
we get both the properties for StandoffDataTypeTag
(whcih inherits from StandoffTag
) and DateBase
representing a date value type.
Basically, we have to categories: VisualTag and DataTypeTag. Should there be a third for linking tags so they would not directly be sublcasses of StandoffTag
?
And would it make sense to call StandoffVisualTag
something else since a paragraph is also a logical information. What about TextualMarkupTag
? That would be more neutral although it would imply that a DataTypeTag is not markup ...
Future steps:
StandoffUtil
so they store RDF class and property names rather than namespaced XML names.StandoffUtil
, implement a way of specifying a mapping between namespaced XML tags and attribute names and standoff RDF class and property names.TextValue
to share its string with another TextValue
.TextValue
with standoff.TextValue
IRI, and exports the TextValue
as XML.
Closing this because the core features have been implemented. Let's use them for a little while, then revisit this for the next iteration of design and implementation.
We've been talking about redesigning how we store standoff in RDF, to make it flexible enough to meet the needs that different projects have described to us.
Requirements
Design
This could be done by representing standoff nodes as Knora resources of class
StandoffTag
rather than as blank RDF nodes.The
TextValue
class would not point directly to standoff resources, because this would not make sense if there were multiple standoff perspectives on the same text. However, theTextValue
could optionally point to a defaultStandoffPerspective
. When the API received a text value with standoff, it would generate aStandoffPerspective
and standoff resources, and create aTextValue
with the propertyhasDefaultStandoffPerspective
pointing to the new standoff perspective. In API version 1, when the text value was returned in response to an API GET request, it could be returned along with its default standoff perspective, to match the current API behaviour.Projects would be able to define their own subclasses of
StandoffTag
with specific semantics. We would provide a new API route for converting an XML document into text with standoff markup. The text would be uploaded along with parameters specifying how to map XML tags (or combinations of tags and attributes) to standoff resource classes.Currently, we maintain direct links between a resource containing text and resources that the text's standoff markup refers to. This is useful because these links are easily shown as incoming links to the target resource, and are easily found in searches. This is possible because the resource contains a reference count for each of these links. Maintaining reference counts is possible because we lock each resource when changing its contents. Our resource locks are based on the rule that each API operation locks only one resource at a time, to prevent deadlocks. If standoff is represented by a graph of resources, it would no longer be feasible to maintain these direct standoff links. Instead, we could have direct links from the standoff resources. This could be accomplished by making
StandoffLinkTag
a subclass ofLinkObj
. EachStandoffLinkTag
would then have a link property pointing to the target resource, as well as a property pointing to the text value itself. As a result, we could eliminate reference counts fromLinkValue
and simplify some code.If we decide to go this route, the question is whether we implement it now or wait until after releasing Knora 1.0. I don't think it would be good for a single repository to have both the old kind of standoff and the new kind. We should pick one or the other. If we wait to make the change, we'll have projects containing standoff RDF that will need to be converted to the new design. My feeling is that it would be better to do it now.
@lrosenth