Redesign Standoff - Githubissues

benjamingeer commented 8 years ago

We've been talking about redesigning how we store standoff in RDF, to make it flexible enough to meet the needs that different projects have described to us.

Requirements

Users can create different standoff perspectives on the same text.
Users can create their own types of standoff nodes with project-specific semantics and custom properties.
Users can annotate standoff nodes.
When a new version of a text is made, any annotations on standoff nodes will also point to the equivalent nodes in the new version, assuming those nodes still exist.
Users can convert an XML document to standoff and back, and get back an equivalent XML document.
Design

This could be done by representing standoff nodes as Knora resources of class StandoffTag rather than as blank RDF nodes.

The TextValue class would not point directly to standoff resources, because this would not make sense if there were multiple standoff perspectives on the same text. However, the TextValue could optionally point to a default StandoffPerspective. When the API received a text value with standoff, it would generate a StandoffPerspective and standoff resources, and create a TextValue with the property hasDefaultStandoffPerspective pointing to the new standoff perspective. In API version 1, when the text value was returned in response to an API GET request, it could be returned along with its default standoff perspective, to match the current API behaviour.

Projects would be able to define their own subclasses of StandoffTag with specific semantics. We would provide a new API route for converting an XML document into text with standoff markup. The text would be uploaded along with parameters specifying how to map XML tags (or combinations of tags and attributes) to standoff resource classes.

Currently, we maintain direct links between a resource containing text and resources that the text's standoff markup refers to. This is useful because these links are easily shown as incoming links to the target resource, and are easily found in searches. This is possible because the resource contains a reference count for each of these links. Maintaining reference counts is possible because we lock each resource when changing its contents. Our resource locks are based on the rule that each API operation locks only one resource at a time, to prevent deadlocks. If standoff is represented by a graph of resources, it would no longer be feasible to maintain these direct standoff links. Instead, we could have direct links from the standoff resources. This could be accomplished by making StandoffLinkTag a subclass of LinkObj. Each StandoffLinkTag would then have a link property pointing to the target resource, as well as a property pointing to the text value itself. As a result, we could eliminate reference counts from LinkValue and simplify some code.

If we decide to go this route, the question is whether we implement it now or wait until after releasing Knora 1.0. I don't think it would be good for a single repository to have both the old kind of standoff and the new kind. We should pick one or the other. If we wait to make the change, we'll have projects containing standoff RDF that will need to be converted to the new design. My feeling is that it would be better to do it now.

@lrosenth

benjamingeer commented 8 years ago

Work in progress is on the wip/standoff branch.

benjamingeer commented 8 years ago

Merging changes from an updated transcription

I've been working on this question: if you make a diplomatic transcription of a manuscript, and then a critical text based on it, what happens when there's a new version of the transcription?

Our original idea was to avoid all redundancy in text storage, by having the transcription and the critical text contain standoff referring to the same Unicode string. However, I now think this isn't a good idea, for two reasons:

If text has been crossed out ("I like ~~spaghetti~~pizza"), the Unicode string of the transcription is likely to contain nonsense ("spaghettipizza") that you don't want in search results, and it won't contain things you actually want to search for ("I like pizza").
To get a readable text out of the Unicode string of the transcription, you would always have to process it first because of deletions and insertions specified in the editorial text, and this would make doing anything else with it more complicated.

Therefore I think it's better if the editorial text has its own Unicode string. To keep the transcription's Unicode string out of full-text searches, we can make it the object of a new property, valueHasNonIndexedString, which would be used instead of valueHasString when the string should not be indexed for full-text search.

The next question is how to maintain the connection between the editorial text and the transcription. The user should be able to click on parts of the editorial text, and be taken to the corresponding parts of the transcription.

I think this can be done with diffs. If you diff the editorial text against the transcription, you get a list of substrings, each of which is marked to indicate whether it's same in both texts, was deleted from the editorial text, or was added to the editorial text. Each of these substrings can easily be converted into a pair of standoff ranges, one referring to the transcription and one referring to the editorial text. With this, given any position in the editorial text, you can find the corresponding position in the transcription, because there are only two possibilities:

The position is in a substring that's the same in both strings. Take the relative offset of the position from the beginning of that string, and add it to the absolute offset of the beginning of the string in the transcription.
The position is in a substring that was inserted. Use the position in the transcription where it was inserted.

Now, what happens when someone makes a new version of the transcription? We have three versions to consider: transcription version 1, transcription version 2, and editorial version 1. Here's an example, using XML tags to represent standoff ranges.

Transcription version 1:

<para>I took the bus <strike>today </strike>because I was was in a hurry.</para>

Editorial version 1 keeps the <para>, deletes <strike>today</strike>, and corrects the repeated word was:

<para>I took the bus because I was in a hurry.</para>

Now the transcriber corrects the transcription by changing bus to train, correcting the repeated word (which, it turns out, isn't repeated in the manuscript), and noting a change in ink colour, to produce transcription version 2:

<para>I took the train <strike>today </strike>because I was in a <blue>hurry</blue>.</para>

Now the editor wants to merge the transcriber's changes into the editorial version, to make editorial version 2. This is similar to the three-way merge performed by git and other software version control systems. But git automatically merges changes that it sees as non-conflicting. We can't do that, because any change in a transcription could potentially change the interpretation of everything else. After seeing some particular change introduced in the new version of the transcription, the editor may need to reconsider some of his previous edits. So here's what should happen:

The software should show all the differences between transcription version 2 and editorial version 1, and allow the editor to choose in each case. In other words, it should show the changes that would have to be made to transcription version 2 to turn it into editorial version 1, as if editorial version 1 was actually based on transcription version 2. (Note that the correction to the repeated was is no longer a difference, since the editor's and transcriber's versions now agree on that.)
It should highlight train and <blue> as new changes, and give the editor the option of keeping all the other previous edits from editorial version 1 (namely the deletion of today).

The editor chooses to accept train and <blue> and keep his previous edits. The result is editorial version 2:

<para>I took the train because I was in a <blue>hurry</blue>.</para>

There are two obstacles to implementing this:

When we display the editorial text, we can diff its Unicode string with the Unicode string of the new transcription, and show those differences to the user. But it's much harder to determine which of those differences are the result of new changes in the transcription, and which ones were already there before. (We can't use the command-line program diff3, because it's line-based.) It would be annoying for the user to have to check each difference to figure out which ones are new.
It doesn't seem feasible to diff standoff, so you can't diff the standoff of the editorial text with the standoff of the transcription. So if new standoff was added to the transcription (<blue> in the example below), we can't write an algorithm to determine the position in the editorial text where that new standoff should be merged. Sometimes we could make a reasonable suggestion, but often (especially if text has been moved) we couldn't. So the user has to find the correct position manually.

Instead of solving these two problems algorithmically, I suggest we deal with them by providing a user interface that makes it easier for the user to do the work manually, using two windows (or panes): a read-only window showing the transcription, and an editable window showing the editorial text. In the transcription window, we can show what changed in the transcription, both in the Unicode string and in the standoff. To show the changes in the Unicode string, we just need to diff it with the previous version. To show the changes in the standoff, we give each standoff range a unique ID (e.g. a UUID) when it's created. Then we can easily identify which ones were added and which ones were removed in the transcription. For example, we can tell the user that <blue> was added to the transcription. Then if the user wants to merge <blue> into the editorial text, all he has to do is figure out where to put it.

Here's how this could work:

The editor looks in the transcription window to see what was changed in the transcription. He clicks on Go to next change and sees that bus was changed to train. He looks for the corresponding change in the editorial text window (he has to do this visually, because we can't automate it), clicks on it to open a context menu, and selects Accept current transcription variant. The GUI replaces bus with train.

In the transcription window, he clicks on Go to next change again and sees that the tag <blue> was added. He drags and drops the tag from the transcription window into the editorial window, onto the word hurry (he has to do this manually because we can't automate it). This adds the standoff tag <blue> to the editorial text.

Now he's finished. He has one diff left in the editorial text, the deletion of today, but he wants to keep that. So he saves the editorial text, creating editorial version 3, which is stored with its standoff and a link to transcription 2. The diffs between editorial version 3 and transcription 2 can be generated again any time they're needed.

benjamingeer commented 8 years ago

Keeping annotations when a text is updated

Suppose a user makes a text with standoff, then adds Annotation resources that annotate the standoff resources. Then the user creates a new version of the text, with a new set of standoff resources. Some of the standoff is the same as in the old version. Can the existing annotations be made to point to the standoff resources that haven't changed?

Here's my suggestion: If every standoff resource gets a UUID when it's created, as described above, then we can make a new kind of annotation property, which points not to an IRI, but to a UUID. When a user edits standoff, we can keep the same UUIDs for any standoff tags that the user doesn't delete, and make new standoff resources with the same UUIDs. So any annotations pointing to those UUIDs will now point to standoff resources referring to the new version of the text (as well as to standoff resources referring to the old version).

benjamingeer commented 8 years ago

Converting XML to standoff and back

There's a prototype of this on the wip/standoff branch. Each XML tag or text node becomes a standoff range. If the XML tag has attributes, we store those with the standoff range. Then it's easy to convert back to XML and get an equivalent document.

To support custom standoff resource classes, the user will need to supply a mapping between XML names and RDF IRIs, specifically:

between namespaced XML tag names and standoff resource classes
between namespaced XML attribute names and RDF properties

One option would be to upload this mapping with the XML document to be converted. Another option would be to construct the mapping as a Knora resource, so it could be reused.

Standoff UUIDs in XML

An uploaded XML document could provide its own UUIDs in tags, which would become the UUIDs of the standoff nodes created for those tags. Then when the user uploaded a new version of the XML document, any annotations pointing to those UUIDs would also refer to the new version of the text.

Links to existing Knora resources

For example, in a document in which persons are tagged with unique IDs, each <person> tag could refer to a Person resource in Knora. The XML could say something like:

<person person-id="http://data.knora.org/12345">Albert Einstein</person>

The user-supplied XML-to-RDF mapping would specify that the person-id attribute of the person tag should be converted into a Knora link to the specified resource.

Hierarchical lists

An XML document could refer to existing hierarchical list nodes in Knora. For example:

<poem type="free-verse">Blah blah blah</poem>

The user-supplied XML-to-RDF mapping would specify that the type attribute of the poem tag should be converted into a ListValue, and that free-verse should be converted to the IRI of a particular ListNode.

benjamingeer commented 8 years ago

Converting non-hierarchical standoff to and from XML

CLIX markup

The implementation on the wip/standoff branch now supports CLIX milestones, which are pairs of empty elements that look like this:

<q who="paris" sID="foo"/>...<q eID="foo"/>

Compared to other XML representations of non-hierarchical markup, CLIX has the advantages of being lightweight, unambiguous, and easy to process (any pair of empty tags with sID and eID attributes can be treated as CLIX tags, and no other information is needed to convert them to standoff). CLIX is recognised in the TEI/XML guidelines as an extension of the TEI, as long as it uses a distinct, non-TEI namespace.

The implementation on wip/standoff can convert an XML document containing a mixture of hierarchical elements and CLIX markup to standoff, and convert the standoff back to the original XML document.

Other approaches

We will also have to support other XML representations of non-hierarchical markup in XML. In one current use case, poetry is marked up in TEI/XML using empty <milestone> elements to indicate syllable boundaries. Each syllable is preceded by one of three milestone tags:

<milestone type="/"/>
<milestone type="\"/>
<milestone type="="/>

Each milestone is implicitly terminated by the next milestone or by another closing tag (e.g. </div>,</l>, or </p>), whichever comes first.

TODO: Figure out what information we need from the user so we can process milestones like these.

Editing Standoff

After some discussion, @tobiasschweizer and I came to the conclusion that it would probably be too difficult (given the resources we have available) to implement a WYSIWYG text editor in JavaScript that supported overlapping markup. As an alternative, we could offer a non-WYSIWYG markup editor and a preview pane, which is an approach that has worked well for LaTeX. Probably the only well-supported markup syntax that's powerful enough to be used in such an editor is XML. Since we have to support XML import and export anyway, using XML in the editor would also be simpler than introducing some other kind of markup.

However, XML is cumbersome to write by hand. There are several JavaScript-based XML editor components that make things easier for the user by doing one or both of the following:

Using an XML schema to suggest elements according to context.
Hiding the XML syntax behind a more user-friendly tree view.

Here are two such components:

One serious problem with editing XML by hand is the difficulty of mixing it with text in right-to-left (RTL) languages such as Arabic, Hebrew, Farsi, and Urdu. Unless the editor has been specifically designed to support RTL text, the standard Unicode bidirectional text (bidi) algorithm displays punctuation marks in XML (such as < and >) in the wrong places, making it difficult to see where the markup actually is. Here are some descriptions of this nightmare:

The commercial XML editor Oxygen seems to deal with this problem quite well, both in text mode, where you can see the XML syntax, and in author mode, which is a tree view that hides the XML syntax (see their video demo). For Knora users working with RTL text, buying an Oxygen licence would be one option. However, this would be rather cumbersome, because text would have to be exported from Knora into Oxygen and back again, if only by copying and pasting.

In theory, it should be possible for a JavaScript-based XML editor to deal with RTL text as well as Oxygen does, using the improved bidi support introduced in HTML5 and in in Unicode 6.3. In practice, there seems to be no existing JavaScript XML editor that does this.

Xonomy prevents the browser from joining Arabic characters together, by putting each character in a separate <span> element, thus rendering the text illegible.
CodeMirror allows punctuation to get mixed up as described above.

Combining transcriptions of different page regions into a critical text

This could be done by copying the transcription of each region into the critical text, and marking the copied text with a standoff tag containing a link to the original region. However, the question is how to handle a merge when the original transcription is updated. Probably the simplest approach would be to open the relevant part of the critical text in a separate editor window, proceed as described above, then copy the merged region text back into the critical text.

benjamingeer commented 8 years ago

Something to cite when we write a paper about this:

Marilyn Deegan and Kathryn Sutherland, Transferred Illusions: Digital Technology and the Forms of Print. Routledge, 2016.

p. 87:

transferred-illusions-87

They go on to cite two projects, ARCHway and JITM, that could be useful points of comparison:

archway jitm

They also cite an article with the great title Embedded Markup Considered Harmful:

SGML advocates I have talked to appear to have the belief that everything is either sequential and hierarchical, or can be represented that way. What is not expresssible sequentially and hierarchically is deemed to be nonexistent, inconceivable, evil, or mistaken....

You can always force structures into other structures and claim that they're undamaged; another way to say this is that if you smash things up it is easier to make them fit. Enforcing sequence and hierarchy simply restricts the possibilities.

Like a TV dinner, embedded markup nominally contains everything you could want. "What else could you possibly want?" means "It's not on the menu."

benjamingeer commented 8 years ago

What we're doing here is a bit like Just-in-Time Markup. There's a demo of JITM online but it hasn't been updated since 2004 and I can't really tell if it's working.

benjamingeer commented 8 years ago

I did a mockup of how to use HTML to render XML markup with RTL text, and submitted an issue to CodeMirror explaining the problem and the suggested solution.

benjamingeer commented 8 years ago

Why an XML editor isn't a good solution

Anne-Sophie Bories is working with XML files in which every syllable is marked up. The text is thus completely unreadable. I can understand that people do this because there's no better alternative, but it's still awful. It would be much better to have a WYSIWYG editor, even if it was a standalone application.

I'm going to make some mockups and see what's feasible in different programming languages.

benjamingeer commented 8 years ago

Notes on discussion with @tobiasschweizer:

Given the discussion above, there is no reason for a TextValue to have more than one standoff perspective. It's simpler to copy the text into a new TextValue and mark up the copy, which can be linked back to its source and compared with it.
If resource R1 contains a TextValue with a standoff link to R2, it would be useful to show an incoming link between R1 and R2, which is what API v1 currently does. We can't actually create such a link, for the reasons explained above, but we can probably simulate it when we query incoming links in the resources responder.
The user should be able to combine transcriptions of different regions into a single critical text. Portions of the text that came from different regions can be marked as such using a dedicated standoff tag, so the user can click on them and view the corresponding region transcription. To see what was added or deleted in the critical text, a diff will be sufficient. But we also need a way to indicate that something that was in the transcription, and was kept in the critical text, was interpreted as an addition by the author (see this example). This can be done with a semantic standoff tag called something like addition.

benjamingeer commented 8 years ago

After discussion with @lrosenth and @tobiasschweizer:

To limit the number of standoff resources created while a user is editing a text, we need to offer an editor that stores text with standoff in an intermediate format, e.g. as XML. The user would then be able to 'publish' a text as RDF when ready.

To deal with this issue:

If someone has annotated a standoff resource by linking to its IRI, we would have to worry about the cardinality constraints on the annotation. We could eliminate this problem if we allowed annotations on standoff only using UUIDs, which are not involved in constraint checking.

We can make two subclasses of Resource, called LinkableResource and NonLinkableResource. The property hasLinkTo will only be allowed to point to a LinkableResource.

benjamingeer commented 8 years ago

What to do next:

Finish reworking knora-base as described above.
Support StandoffValueV1 in ValueUtilV1.
Finish supporting StandoffValueV1 in the SPARQL templates.
Make a standoff responder that creates standoff resources on behalf of the resources responder and the values responder. It should accept standoff either in the form used in TextValueV1 or in the form used in StandoffUtil.
Ensure that these responders still lock only one resource at a time.

benjamingeer commented 8 years ago

Much of our discussion today was about how to reduce the storage and performance overhead of RDF/standoff. Here are some ideas on how to make standoff resources more lightweight:

They (and their values) could inherit their owner, project, and permissions (if not specified) from the parent TextValue. That would eliminate a lot of redundant triples in typical cases. There is a use case in which you need to share standoff but not the text that it refers to (because it's copyrighted), but this will be exceptional.
We could eliminate StandoffValue and include its contents directly in the resource as datatype properties.

Then I think you could make a minimal standoff resource with just five predicates:

rdf:type
isTagOnText
startHasOffset
endHasOffset
isDeleted

This would require more standoff-specific Scala code (to copy permissions and to read and write the special standoff properties), but probably not much more. What do you think, @lrosenth?

benjamingeer commented 8 years ago

Standoff resources as described above are a special kind of resource:

They depend on a Value, from which they can inherit owner, project, and permissions.
They're immutable once created.
They're automatically deleted when the value that they belong to is deleted or changed.
You can't link to them except by using their UUID.

Let's make this explicit by creating a subclass of Resource called DependentResource that has these characteristics. Knora can refuse to violate these rules when it's dealing with a DependentResource. Then StandoffTag can be a subclass of DependentResource.

benjamingeer commented 8 years ago

Use case: commenting on an article

Someone uses Knora to write an academic paper. A lot of people want to highlight parts of it and comment on them. There's no reason for them to copy the text of the article into a new TextValue, because they're not going to change it. How can they mark it up with their own standoff, while still keeping the principle that standoff resources are dependent on a TextValue?

To solve this, we could allow a derived TextValue to use the string of the base TextValue by reference rather than by copying it. If you create a new TextValue that's derived from an existing TextValue (even in another resource), and its valueHasString is identical, we can avoid storing the string redundantly:

<http://data.knora.org/1/values/1> rdf:type knora-base:TextValue ;
    knora-base:valueHasString "This is the original text" .

<http://data.knora.org/2/values/1> rdf:type knora-base:TextValue ;
    knora-base:isDerivedFrom <http://data.knora.org/1/values/1> ;
    knora-base:valueSharesString true .

When the user creates or edits <http://data.knora.org/2/values/1>, we can check whether they modified the original string. If so, we store the modified copy. If not, we use valueSharesString true.

This allows the derived TextValue to have its own dependent standoff.

benjamingeer commented 8 years ago

Use cases: querying standoff tags

There are tags marking up poem types, lines, words, and syllables. Find all instances of the word "pizza" within the unbrokenline tags where poemType is mixedprose and the number of syllables within the unbroken line is between 10 and 12.
Find all the texts that mention Jacob Bernoulli (identified by a person tag with a reference to his IRI).
Find all the texts that mention a date between 1490 and 1495 (identified by a date tag).

Actually, standoff tags shouldn't be resources

For one simple reason: if a text has 1000 standoff tags, querying the entire contents of 1000 resources will be too slow.

benjamingeer commented 8 years ago

Result of conversation with @lrosenth:

A standoff tag will be something more like a Value than a Resource. We will make a set of more abstract data types, like DateData, ColorData, etc.. Then DateValue will be a subclass of Value and of DateData, and StandoffDateTag will be a subclass of StandoffTag and of DateData. Then it will be possible to search for a date regardless of whether it's a standoff tag or a value.

Instead of having one standoff tag class for rich-text markup, we'll have different classes for bold, italic, strikethrough, etc., to simplify queries.

Each project will be able to define its own subclasses of StandoffTag.

XML attributes will be represented as predicates and objects on standoff tags. The user who uploads an XML file will have to provide a mapping between (XML namespace, attribute name) and the corresponding RDF property.

So we will keep the basic structure we have now: a standoff tag will be an object attached to a TextValue. For now, it will not have permissions, owner, project, or versioning, but we could add these later if necessary.

tobiasschweizer commented 8 years ago

We should decide how to represent line breaks, I thought it would be a good idea to represent them inside the text string itself so words that are in different lines do not get attached to one another.

benjamingeer commented 8 years ago

To avoid the carriage return vs. newline problem, I suggest using the Unicode character INFORMATION SEPARATOR TWO, which is available as the constant org.knora.webapi.util.FormatConstants.INFORMATION_SEPARATOR_TWO. Historically it's a record separator, for separating things like rows in a table.

tobiasschweizer commented 8 years ago

Yes, you are right! I remember the problems we had because these characters got converted.

I will have to check if this is easy to do in the GUI, so the text would already be correct.

tobiasschweizer commented 8 years ago

@benjamingeer

This is a first attempt:

###  http://www.knora.org/ontology/knora-base#StandoffTag

:StandoffTag rdf:type owl:Class ;

          rdfs:subClassOf [ rdf:type owl:Restriction ;
                            owl:onProperty :standoffHasStart ;
                            owl:cardinality "1"^^xsd:nonNegativeInteger
                          ] ,
                          [ rdf:type owl:Restriction ;
                            owl:onProperty :standoffHasEnd ;
                            owl:cardinality "1"^^xsd:nonNegativeInteger
                          ] ;

          rdfs:comment "Represents a standoff markup tag in a TextValue"@en .

###  http://www.knora.org/ontology/knora-base#StandoffHrefTag

:StandoffHrefTag rdf:type owl:Class ;

              rdfs:subClassOf :StandoffTag ,
                              [ rdf:type owl:Restriction ;
                                owl:onProperty :standoffHasHref ;
                                owl:cardinality "1"^^xsd:nonNegativeInteger
                              ] ;

              rdfs:comment "Represents a hyperlink in a TextValue"@en .

###  http://www.knora.org/ontology/knora-base#StandoffLinkTag

:StandoffLinkTag rdf:type owl:Class ;

              rdfs:subClassOf :StandoffTag ,
                              [ rdf:type owl:Restriction ;
                                owl:onProperty :standoffHasLink ;
                                owl:cardinality "1"^^xsd:nonNegativeInteger
                              ] ;

              rdfs:comment "Represents a reference to a Knora resource in a TextValue"@en .

###  http://www.knora.org/ontology/knora-base#StandoffVisualTag

:StandoffVisualTag rdf:type owl:Class ;

               rdfs:subClassOf :StandoffTag ;

               rdfs:comment "Represents markup information needed to render a TextValue"@en .

###  http://www.knora.org/ontology/knora-base#StandoffParagraphTag

:StandoffParagraphTag rdf:type owl:Class ;

                rdfs:subClassOf :StandoffVisualTag ;

                rdfs:comment "Represents a paragraph in a TextValue"@en .

###  http://www.knora.org/ontology/knora-base#StandoffDataTypeTag

:StandoffDataTypeTag rdf:type owl:Class ;

                 rdfs:subClassOf :StandoffTag ;

                 rdfs:comment "Represents a value type in a TextValue"@en .

###  http://www.knora.org/ontology/knora-base#StandoffDateValueTag

:StandoffDateValueTag rdf:type owl:Class ;

                        rdfs:subClassOf :StandoffDataTypeTag,
                                        :DateBase ;

                        rdfs:comment "Represents a DateValue in a TextValue"@en .

So now when we query the class StandoffDateValueTag we get both the properties for StandoffDataTypeTag (whcih inherits from StandoffTag) and DateBase representing a date value type.

Basically, we have to categories: VisualTag and DataTypeTag. Should there be a third for linking tags so they would not directly be sublcasses of StandoffTag?

And would it make sense to call StandoffVisualTag something else since a paragraph is also a logical information. What about TextualMarkupTag? That would be more neutral although it would imply that a DataTypeTag is not markup ...

benjamingeer commented 8 years ago

Future steps:

Change the case classes used in StandoffUtil so they store RDF class and property names rather than namespaced XML names.
Change the SPARQL for querying and updating standoff so it uses those case classes.
In StandoffUtil, implement a way of specifying a mapping between namespaced XML tags and attribute names and standoff RDF class and property names.
Allow a TextValue to share its string with another TextValue.
Add a route that accepts XML and a mapping, and creates a TextValue with standoff.
Add a route that accepts a mapping and a TextValue IRI, and exports the TextValue as XML.
- For non-hierarchical markup, in addition to CLIX, support XML milestones that can be terminated by a list of start or end tags specified in the mapping (e.g. syllable milestones that can be terminated by anything except the start or end of a word).

benjamingeer commented 7 years ago

Closing this because the core features have been implemented. Let's use them for a little while, then revisit this for the next iteration of design and implementation.

dasch-swiss / dsp-api

Redesign Standoff #101

Requirements

Design

Merging changes from an updated transcription

Keeping annotations when a text is updated

Converting XML to standoff and back

Standoff UUIDs in XML

Links to existing Knora resources

Hierarchical lists

Converting non-hierarchical standoff to and from XML

CLIX markup

Other approaches

Editing Standoff

Combining transcriptions of different page regions into a critical text

Why an XML editor isn't a good solution

Use case: commenting on an article

Use cases: querying standoff tags

Actually, standoff tags shouldn't be resources