dasch-swiss / dsp-api

DaSCH Service Platform API
http://dasch-swiss.github.io/dsp-api/
Apache License 2.0
74 stars 18 forks source link

Redesign Standoff #101

Closed benjamingeer closed 7 years ago

benjamingeer commented 8 years ago

We've been talking about redesigning how we store standoff in RDF, to make it flexible enough to meet the needs that different projects have described to us.

Requirements

This could be done by representing standoff nodes as Knora resources of class StandoffTag rather than as blank RDF nodes.

The TextValue class would not point directly to standoff resources, because this would not make sense if there were multiple standoff perspectives on the same text. However, the TextValue could optionally point to a default StandoffPerspective. When the API received a text value with standoff, it would generate a StandoffPerspective and standoff resources, and create a TextValue with the property hasDefaultStandoffPerspective pointing to the new standoff perspective. In API version 1, when the text value was returned in response to an API GET request, it could be returned along with its default standoff perspective, to match the current API behaviour.

Projects would be able to define their own subclasses of StandoffTag with specific semantics. We would provide a new API route for converting an XML document into text with standoff markup. The text would be uploaded along with parameters specifying how to map XML tags (or combinations of tags and attributes) to standoff resource classes.

Currently, we maintain direct links between a resource containing text and resources that the text's standoff markup refers to. This is useful because these links are easily shown as incoming links to the target resource, and are easily found in searches. This is possible because the resource contains a reference count for each of these links. Maintaining reference counts is possible because we lock each resource when changing its contents. Our resource locks are based on the rule that each API operation locks only one resource at a time, to prevent deadlocks. If standoff is represented by a graph of resources, it would no longer be feasible to maintain these direct standoff links. Instead, we could have direct links from the standoff resources. This could be accomplished by making StandoffLinkTag a subclass of LinkObj. Each StandoffLinkTag would then have a link property pointing to the target resource, as well as a property pointing to the text value itself. As a result, we could eliminate reference counts from LinkValue and simplify some code.

If we decide to go this route, the question is whether we implement it now or wait until after releasing Knora 1.0. I don't think it would be good for a single repository to have both the old kind of standoff and the new kind. We should pick one or the other. If we wait to make the change, we'll have projects containing standoff RDF that will need to be converted to the new design. My feeling is that it would be better to do it now.

@lrosenth

benjamingeer commented 8 years ago

Work in progress is on the wip/standoff branch.

benjamingeer commented 8 years ago

Merging changes from an updated transcription

I've been working on this question: if you make a diplomatic transcription of a manuscript, and then a critical text based on it, what happens when there's a new version of the transcription?

Our original idea was to avoid all redundancy in text storage, by having the transcription and the critical text contain standoff referring to the same Unicode string. However, I now think this isn't a good idea, for two reasons:

  1. If text has been crossed out ("I like spaghettipizza"), the Unicode string of the transcription is likely to contain nonsense ("spaghettipizza") that you don't want in search results, and it won't contain things you actually want to search for ("I like pizza").
  2. To get a readable text out of the Unicode string of the transcription, you would always have to process it first because of deletions and insertions specified in the editorial text, and this would make doing anything else with it more complicated.

Therefore I think it's better if the editorial text has its own Unicode string. To keep the transcription's Unicode string out of full-text searches, we can make it the object of a new property, valueHasNonIndexedString, which would be used instead of valueHasString when the string should not be indexed for full-text search.

The next question is how to maintain the connection between the editorial text and the transcription. The user should be able to click on parts of the editorial text, and be taken to the corresponding parts of the transcription.

I think this can be done with diffs. If you diff the editorial text against the transcription, you get a list of substrings, each of which is marked to indicate whether it's same in both texts, was deleted from the editorial text, or was added to the editorial text. Each of these substrings can easily be converted into a pair of standoff ranges, one referring to the transcription and one referring to the editorial text. With this, given any position in the editorial text, you can find the corresponding position in the transcription, because there are only two possibilities:

  1. The position is in a substring that's the same in both strings. Take the relative offset of the position from the beginning of that string, and add it to the absolute offset of the beginning of the string in the transcription.
  2. The position is in a substring that was inserted. Use the position in the transcription where it was inserted.

Now, what happens when someone makes a new version of the transcription? We have three versions to consider: transcription version 1, transcription version 2, and editorial version 1. Here's an example, using XML tags to represent standoff ranges.

Transcription version 1:

<para>I took the bus <strike>today </strike>because I was was in a hurry.</para>

Editorial version 1 keeps the <para>, deletes <strike>today</strike>, and corrects the repeated word was:

<para>I took the bus because I was in a hurry.</para>

Now the transcriber corrects the transcription by changing bus to train, correcting the repeated word (which, it turns out, isn't repeated in the manuscript), and noting a change in ink colour, to produce transcription version 2:

<para>I took the train <strike>today </strike>because I was in a <blue>hurry</blue>.</para>

Now the editor wants to merge the transcriber's changes into the editorial version, to make editorial version 2. This is similar to the three-way merge performed by git and other software version control systems. But git automatically merges changes that it sees as non-conflicting. We can't do that, because any change in a transcription could potentially change the interpretation of everything else. After seeing some particular change introduced in the new version of the transcription, the editor may need to reconsider some of his previous edits. So here's what should happen:

The editor chooses to accept train and <blue> and keep his previous edits. The result is editorial version 2:

<para>I took the train because I was in a <blue>hurry</blue>.</para>

There are two obstacles to implementing this:

  1. When we display the editorial text, we can diff its Unicode string with the Unicode string of the new transcription, and show those differences to the user. But it's much harder to determine which of those differences are the result of new changes in the transcription, and which ones were already there before. (We can't use the command-line program diff3, because it's line-based.) It would be annoying for the user to have to check each difference to figure out which ones are new.
  2. It doesn't seem feasible to diff standoff, so you can't diff the standoff of the editorial text with the standoff of the transcription. So if new standoff was added to the transcription (<blue> in the example below), we can't write an algorithm to determine the position in the editorial text where that new standoff should be merged. Sometimes we could make a reasonable suggestion, but often (especially if text has been moved) we couldn't. So the user has to find the correct position manually.

Instead of solving these two problems algorithmically, I suggest we deal with them by providing a user interface that makes it easier for the user to do the work manually, using two windows (or panes): a read-only window showing the transcription, and an editable window showing the editorial text. In the transcription window, we can show what changed in the transcription, both in the Unicode string and in the standoff. To show the changes in the Unicode string, we just need to diff it with the previous version. To show the changes in the standoff, we give each standoff range a unique ID (e.g. a UUID) when it's created. Then we can easily identify which ones were added and which ones were removed in the transcription. For example, we can tell the user that <blue> was added to the transcription. Then if the user wants to merge <blue> into the editorial text, all he has to do is figure out where to put it.

Here's how this could work:

The editor looks in the transcription window to see what was changed in the transcription. He clicks on Go to next change and sees that bus was changed to train. He looks for the corresponding change in the editorial text window (he has to do this visually, because we can't automate it), clicks on it to open a context menu, and selects Accept current transcription variant. The GUI replaces bus with train.

In the transcription window, he clicks on Go to next change again and sees that the tag <blue> was added. He drags and drops the tag from the transcription window into the editorial window, onto the word hurry (he has to do this manually because we can't automate it). This adds the standoff tag <blue> to the editorial text.

Now he's finished. He has one diff left in the editorial text, the deletion of today, but he wants to keep that. So he saves the editorial text, creating editorial version 3, which is stored with its standoff and a link to transcription 2. The diffs between editorial version 3 and transcription 2 can be generated again any time they're needed.

benjamingeer commented 8 years ago

Keeping annotations when a text is updated

Suppose a user makes a text with standoff, then adds Annotation resources that annotate the standoff resources. Then the user creates a new version of the text, with a new set of standoff resources. Some of the standoff is the same as in the old version. Can the existing annotations be made to point to the standoff resources that haven't changed?

Here's my suggestion: If every standoff resource gets a UUID when it's created, as described above, then we can make a new kind of annotation property, which points not to an IRI, but to a UUID. When a user edits standoff, we can keep the same UUIDs for any standoff tags that the user doesn't delete, and make new standoff resources with the same UUIDs. So any annotations pointing to those UUIDs will now point to standoff resources referring to the new version of the text (as well as to standoff resources referring to the old version).

benjamingeer commented 8 years ago

Converting XML to standoff and back

There's a prototype of this on the wip/standoff branch. Each XML tag or text node becomes a standoff range. If the XML tag has attributes, we store those with the standoff range. Then it's easy to convert back to XML and get an equivalent document.

To support custom standoff resource classes, the user will need to supply a mapping between XML names and RDF IRIs, specifically:

One option would be to upload this mapping with the XML document to be converted. Another option would be to construct the mapping as a Knora resource, so it could be reused.

Standoff UUIDs in XML

An uploaded XML document could provide its own UUIDs in tags, which would become the UUIDs of the standoff nodes created for those tags. Then when the user uploaded a new version of the XML document, any annotations pointing to those UUIDs would also refer to the new version of the text.

Links to existing Knora resources

For example, in a document in which persons are tagged with unique IDs, each <person> tag could refer to a Person resource in Knora. The XML could say something like:

<person person-id="http://data.knora.org/12345">Albert Einstein</person>

The user-supplied XML-to-RDF mapping would specify that the person-id attribute of the person tag should be converted into a Knora link to the specified resource.

Hierarchical lists

An XML document could refer to existing hierarchical list nodes in Knora. For example:

<poem type="free-verse">Blah blah blah</poem>

The user-supplied XML-to-RDF mapping would specify that the type attribute of the poem tag should be converted into a ListValue, and that free-verse should be converted to the IRI of a particular ListNode.

benjamingeer commented 8 years ago

Converting non-hierarchical standoff to and from XML

CLIX markup

The implementation on the wip/standoff branch now supports CLIX milestones, which are pairs of empty elements that look like this:

<q who="paris" sID="foo"/>...<q eID="foo"/>

Compared to other XML representations of non-hierarchical markup, CLIX has the advantages of being lightweight, unambiguous, and easy to process (any pair of empty tags with sID and eID attributes can be treated as CLIX tags, and no other information is needed to convert them to standoff). CLIX is recognised in the TEI/XML guidelines as an extension of the TEI, as long as it uses a distinct, non-TEI namespace.

The implementation on wip/standoff can convert an XML document containing a mixture of hierarchical elements and CLIX markup to standoff, and convert the standoff back to the original XML document.

Other approaches

We will also have to support other XML representations of non-hierarchical markup in XML. In one current use case, poetry is marked up in TEI/XML using empty <milestone> elements to indicate syllable boundaries. Each syllable is preceded by one of three milestone tags:

Each milestone is implicitly terminated by the next milestone or by another closing tag (e.g. </div>,</l>, or </p>), whichever comes first.

TODO: Figure out what information we need from the user so we can process milestones like these.

Editing Standoff

After some discussion, @tobiasschweizer and I came to the conclusion that it would probably be too difficult (given the resources we have available) to implement a WYSIWYG text editor in JavaScript that supported overlapping markup. As an alternative, we could offer a non-WYSIWYG markup editor and a preview pane, which is an approach that has worked well for LaTeX. Probably the only well-supported markup syntax that's powerful enough to be used in such an editor is XML. Since we have to support XML import and export anyway, using XML in the editor would also be simpler than introducing some other kind of markup.

However, XML is cumbersome to write by hand. There are several JavaScript-based XML editor components that make things easier for the user by doing one or both of the following:

Here are two such components:

One serious problem with editing XML by hand is the difficulty of mixing it with text in right-to-left (RTL) languages such as Arabic, Hebrew, Farsi, and Urdu. Unless the editor has been specifically designed to support RTL text, the standard Unicode bidirectional text (bidi) algorithm displays punctuation marks in XML (such as < and >) in the wrong places, making it difficult to see where the markup actually is. Here are some descriptions of this nightmare:

The commercial XML editor Oxygen seems to deal with this problem quite well, both in text mode, where you can see the XML syntax, and in author mode, which is a tree view that hides the XML syntax (see their video demo). For Knora users working with RTL text, buying an Oxygen licence would be one option. However, this would be rather cumbersome, because text would have to be exported from Knora into Oxygen and back again, if only by copying and pasting.

In theory, it should be possible for a JavaScript-based XML editor to deal with RTL text as well as Oxygen does, using the improved bidi support introduced in HTML5 and in in Unicode 6.3. In practice, there seems to be no existing JavaScript XML editor that does this.

Combining transcriptions of different page regions into a critical text

This could be done by copying the transcription of each region into the critical text, and marking the copied text with a standoff tag containing a link to the original region. However, the question is how to handle a merge when the original transcription is updated. Probably the simplest approach would be to open the relevant part of the critical text in a separate editor window, proceed as described above, then copy the merged region text back into the critical text.

benjamingeer commented 8 years ago

Something to cite when we write a paper about this:

Marilyn Deegan and Kathryn Sutherland, Transferred Illusions: Digital Technology and the Forms of Print. Routledge, 2016.

p. 87:

transferred-illusions-87

They go on to cite two projects, ARCHway and JITM, that could be useful points of comparison:

archway jitm

They also cite an article with the great title Embedded Markup Considered Harmful:

SGML advocates I have talked to appear to have the belief that everything is either sequential and hierarchical, or can be represented that way. What is not expresssible sequentially and hierarchically is deemed to be nonexistent, inconceivable, evil, or mistaken....

You can always force structures into other structures and claim that they're undamaged; another way to say this is that if you smash things up it is easier to make them fit. Enforcing sequence and hierarchy simply restricts the possibilities.

Like a TV dinner, embedded markup nominally contains everything you could want. "What else could you possibly want?" means "It's not on the menu."

benjamingeer commented 8 years ago

What we're doing here is a bit like Just-in-Time Markup. There's a demo of JITM online but it hasn't been updated since 2004 and I can't really tell if it's working.

benjamingeer commented 8 years ago

I did a mockup of how to use HTML to render XML markup with RTL text, and submitted an issue to CodeMirror explaining the problem and the suggested solution.

benjamingeer commented 8 years ago

Why an XML editor isn't a good solution

Anne-Sophie Bories is working with XML files in which every syllable is marked up. The text is thus completely unreadable. I can understand that people do this because there's no better alternative, but it's still awful. It would be much better to have a WYSIWYG editor, even if it was a standalone application.

I'm going to make some mockups and see what's feasible in different programming languages.

benjamingeer commented 8 years ago

Notes on discussion with @tobiasschweizer:

benjamingeer commented 8 years ago

After discussion with @lrosenth and @tobiasschweizer:

To limit the number of standoff resources created while a user is editing a text, we need to offer an editor that stores text with standoff in an intermediate format, e.g. as XML. The user would then be able to 'publish' a text as RDF when ready.

To deal with this issue:

If someone has annotated a standoff resource by linking to its IRI, we would have to worry about the cardinality constraints on the annotation. We could eliminate this problem if we allowed annotations on standoff only using UUIDs, which are not involved in constraint checking.

We can make two subclasses of Resource, called LinkableResource and NonLinkableResource. The property hasLinkTo will only be allowed to point to a LinkableResource.

benjamingeer commented 8 years ago

What to do next:

benjamingeer commented 8 years ago

Much of our discussion today was about how to reduce the storage and performance overhead of RDF/standoff. Here are some ideas on how to make standoff resources more lightweight:

Then I think you could make a minimal standoff resource with just five predicates:

  1. rdf:type
  2. isTagOnText
  3. startHasOffset
  4. endHasOffset
  5. isDeleted

This would require more standoff-specific Scala code (to copy permissions and to read and write the special standoff properties), but probably not much more. What do you think, @lrosenth?

benjamingeer commented 8 years ago

Standoff resources as described above are a special kind of resource:

Let's make this explicit by creating a subclass of Resource called DependentResource that has these characteristics. Knora can refuse to violate these rules when it's dealing with a DependentResource. Then StandoffTag can be a subclass of DependentResource.

benjamingeer commented 8 years ago

Use case: commenting on an article

Someone uses Knora to write an academic paper. A lot of people want to highlight parts of it and comment on them. There's no reason for them to copy the text of the article into a new TextValue, because they're not going to change it. How can they mark it up with their own standoff, while still keeping the principle that standoff resources are dependent on a TextValue?

To solve this, we could allow a derived TextValue to use the string of the base TextValue by reference rather than by copying it. If you create a new TextValue that's derived from an existing TextValue (even in another resource), and its valueHasString is identical, we can avoid storing the string redundantly:

<http://data.knora.org/1/values/1> rdf:type knora-base:TextValue ;
    knora-base:valueHasString "This is the original text" .

<http://data.knora.org/2/values/1> rdf:type knora-base:TextValue ;
    knora-base:isDerivedFrom <http://data.knora.org/1/values/1> ;
    knora-base:valueSharesString true .

When the user creates or edits <http://data.knora.org/2/values/1>, we can check whether they modified the original string. If so, we store the modified copy. If not, we use valueSharesString true.

This allows the derived TextValue to have its own dependent standoff.

benjamingeer commented 8 years ago

Use cases: querying standoff tags

Actually, standoff tags shouldn't be resources

For one simple reason: if a text has 1000 standoff tags, querying the entire contents of 1000 resources will be too slow.

benjamingeer commented 8 years ago

Result of conversation with @lrosenth:

A standoff tag will be something more like a Value than a Resource. We will make a set of more abstract data types, like DateData, ColorData, etc.. Then DateValue will be a subclass of Value and of DateData, and StandoffDateTag will be a subclass of StandoffTag and of DateData. Then it will be possible to search for a date regardless of whether it's a standoff tag or a value.

Instead of having one standoff tag class for rich-text markup, we'll have different classes for bold, italic, strikethrough, etc., to simplify queries.

Each project will be able to define its own subclasses of StandoffTag.

XML attributes will be represented as predicates and objects on standoff tags. The user who uploads an XML file will have to provide a mapping between (XML namespace, attribute name) and the corresponding RDF property.

So we will keep the basic structure we have now: a standoff tag will be an object attached to a TextValue. For now, it will not have permissions, owner, project, or versioning, but we could add these later if necessary.

tobiasschweizer commented 8 years ago

We should decide how to represent line breaks, I thought it would be a good idea to represent them inside the text string itself so words that are in different lines do not get attached to one another.

benjamingeer commented 8 years ago

To avoid the carriage return vs. newline problem, I suggest using the Unicode character INFORMATION SEPARATOR TWO, which is available as the constant org.knora.webapi.util.FormatConstants.INFORMATION_SEPARATOR_TWO. Historically it's a record separator, for separating things like rows in a table.

tobiasschweizer commented 8 years ago

Yes, you are right! I remember the problems we had because these characters got converted.

I will have to check if this is easy to do in the GUI, so the text would already be correct.

tobiasschweizer commented 8 years ago

@benjamingeer

This is a first attempt:

###  http://www.knora.org/ontology/knora-base#StandoffTag

:StandoffTag rdf:type owl:Class ;

          rdfs:subClassOf [ rdf:type owl:Restriction ;
                            owl:onProperty :standoffHasStart ;
                            owl:cardinality "1"^^xsd:nonNegativeInteger
                          ] ,
                          [ rdf:type owl:Restriction ;
                            owl:onProperty :standoffHasEnd ;
                            owl:cardinality "1"^^xsd:nonNegativeInteger
                          ] ;

          rdfs:comment "Represents a standoff markup tag in a TextValue"@en .

###  http://www.knora.org/ontology/knora-base#StandoffHrefTag

:StandoffHrefTag rdf:type owl:Class ;

              rdfs:subClassOf :StandoffTag ,
                              [ rdf:type owl:Restriction ;
                                owl:onProperty :standoffHasHref ;
                                owl:cardinality "1"^^xsd:nonNegativeInteger
                              ] ;

              rdfs:comment "Represents a hyperlink in a TextValue"@en .

###  http://www.knora.org/ontology/knora-base#StandoffLinkTag

:StandoffLinkTag rdf:type owl:Class ;

              rdfs:subClassOf :StandoffTag ,
                              [ rdf:type owl:Restriction ;
                                owl:onProperty :standoffHasLink ;
                                owl:cardinality "1"^^xsd:nonNegativeInteger
                              ] ;

              rdfs:comment "Represents a reference to a Knora resource in a TextValue"@en .

###  http://www.knora.org/ontology/knora-base#StandoffVisualTag

:StandoffVisualTag rdf:type owl:Class ;

               rdfs:subClassOf :StandoffTag ;

               rdfs:comment "Represents markup information needed to render a TextValue"@en .

###  http://www.knora.org/ontology/knora-base#StandoffParagraphTag

:StandoffParagraphTag rdf:type owl:Class ;

                rdfs:subClassOf :StandoffVisualTag ;

                rdfs:comment "Represents a paragraph in a TextValue"@en .

###  http://www.knora.org/ontology/knora-base#StandoffDataTypeTag

:StandoffDataTypeTag rdf:type owl:Class ;

                 rdfs:subClassOf :StandoffTag ;

                 rdfs:comment "Represents a value type in a TextValue"@en .

###  http://www.knora.org/ontology/knora-base#StandoffDateValueTag

:StandoffDateValueTag rdf:type owl:Class ;

                        rdfs:subClassOf :StandoffDataTypeTag,
                                        :DateBase ;

                        rdfs:comment "Represents a DateValue in a TextValue"@en .

So now when we query the class StandoffDateValueTag we get both the properties for StandoffDataTypeTag (whcih inherits from StandoffTag) and DateBase representing a date value type.

Basically, we have to categories: VisualTag and DataTypeTag. Should there be a third for linking tags so they would not directly be sublcasses of StandoffTag?

And would it make sense to call StandoffVisualTag something else since a paragraph is also a logical information. What about TextualMarkupTag? That would be more neutral although it would imply that a DataTypeTag is not markup ...

benjamingeer commented 8 years ago

Future steps:

benjamingeer commented 7 years ago

Closing this because the core features have been implemented. Let's use them for a little while, then revisit this for the next iteration of design and implementation.