CFI in BFF - Githubissues

dauwhe commented 8 years ago

Mailing list thread:

https://groups.google.com/forum/#!searchin/epub-working-group/Does$20BFF$20deprecate$20CFI$20use$3F/epub-working-group/HqY8EYguiBw/XNA3PSydBQAJ

dauwhe commented 8 years ago

Brady Duga wrote (in the original thread):

Something that was briefly mentioned in another thread got me thinking about CFI in a BFF world. It looks like BFF is not concerning itself with CFI, which seems fine since we are removing them from content documents anyway (I think that was our decision). However, it seems like the structural changes to BFF would invalidate CFI into BFF-based publications. What happens if a reading system uses CFI to locate annotations, but they receive a publication based on BFF? Are we saying there are a class of epubs that CFI won't work for? And if that is true, aren't we then discarding the utility of CFI? That is, the ability to point into arbitrary locations of an epub outside of your control.

dauwhe commented 8 years ago

Hadrien Gardeur wrote (in the original thread):

For BFF we need something more Web-centric than CFI and less dependent on the spine order too.

BFF means that the publication is pretty much alive and can be updated very often, using the content document position in the spine is not exactly the most stable choice. Also, since each content document has its own URI, there's no reason not to reference the content document directly.

What we're missing is a good way to reference a specific position and/or range in content documents, but that's a bigger problem than just BFF.

dauwhe commented 8 years ago

Matt Garrish wrote (in the original thread):

We have an open question about compatibility of cfis and html within the package, too:

https://github.com/IDPF/epub-revision/issues/662#issuecomment-176843199

dauwhe commented 8 years ago

yu.khramov wrote (in the original thread):

I still think that the exclusion of CFI is a mistake. They are crucial for social reading and the use of EPUB exactly in the browsers. We, at ActiveTextbook, used to have our own home-brewed "CFi-like" format, and the support for CFI in Readium was a very good thing for us.

Could we at least avid deprecating CFI before we get something instead?

dauwhe commented 8 years ago

Garth Conboy wrote (in the original thread):

It seems to be there are a couple of JSON variants being proposed that are basically encoding the OPF (where I'd say the heart of EPUB resides) -- one of them has (renamed) structures for both the manifest and spine (clearly nice for round-tripping) -- if that direction were chosen, it seems existing CFI's could be used to index through the JSON-encoding of the ...

Perhaps.

Best, Garth

dauwhe commented 8 years ago

Hadrien Gardeur wrote (in the original thread):

I'd say that the inclusion of CFI was a mistake and that we should've worked on something more aligned with the Web from the start, but that's a different story...

For CFI with BFF, I'm clearly opposed to the idea of going through the spine, but for the fragment identifier that's another story. Even though it would be better to design something together with the W3C, we could discuss using the same syntax for that part.

dauwhe commented 8 years ago

Garth Conboy wrote (in the original thread):

Hi Hadrien,

I don't quite get "opposed to the idea of going through the spine, but for the fragment identifier that's another story."

If the BFF spine is the OFP spine, what's the harm in not breaking CFI's? Seems we might want to avoid a fair amount of collateral damage... It almost seems, regardless of approach, given the round-tripping requirement, there would be a mapping that could retain the requisite indexing.

Though, might be better for discussion over wine in Bordeaux! :-)

Best, Garth

dauwhe commented 8 years ago

Matt Garrish wrote (in the original thread):

I don't think they're even deprecated at this point. You can still insert them in a document; there's just no requirement that reading systems support linking between documents in a publication.

And we haven't removed CFIs from epub entirely, only dropped the required support for CFIs as a linking mechanism between content documents. There's nothing stopping anyone from using them for any purpose within a reading system (they're still an important part of the annotations spec, for example). If you wanted to translate all the tags to use CFIs, that's perfectly okay.

Matt

dauwhe commented 8 years ago

Hadrien Gardeur wrote (in the original thread):

If the BFF spine is the OFP spine, what's the harm in not breaking CFI's? Seems we might want to avoid a fair amount of collateral damage... It almost seems, regardless of approach, given the round-tripping requirement, there would be a mapping that could retain the requisite indexing.

There are essentially three parts in a CFI:

a media fragment attribute, which by the way doesn't use the same syntax as most other fragment identifiers (epubcfi(...) instead of epubcfi=...)
a location (goes through the spine)
a fragment in the XML tree of the document

I see multiple problems with that approach:

*the EPUB CFI media fragment is used over a URI pointing to an EPUB container (http://location.epub#epubcfi(...)) and then has to dedicate part of the media fragment value to indicate which content document precisely. On the Web, where every content document has its own URI, this indirection through the manifest seems highly unnecessary.

It's actually even worse than that, since such publications are more likely to be frequently updated, using the order in the spine means that my fragment is much more likely to become useless over time since I won't even be pointing to the right content document.
I should be able to share the same annotations no matter where it was initially created. If a resource is included in publication A and publication B, the location and fragment identifier used in the annotation should be the same, no matter if I actually created the annotation in my browser while viewing the resource alone, or in a more dedicated reading environment while reading publication A or B.
I haven't explored into details the implication of using HTML instead of XHTML for the fragment itself, but this could also be problematic with CFI
CFI alone won't be enough in an environment where content documents are much more likely to change, we'll always need to associate more robust mechanisms to fuzzy ones

While a CFI might work given that we still have the equivalent of a spine (well that actually depends on our other discussion about linearity in BFF), it's really not adapted to our new environment.

There are plenty of organizations that have tackled this problem in the last few years (New York Times or Hypothes.is for example) and this is typically the kind of work where our scope shouldn't be limited to the IDPF.

Though, might be better for discussion over wine in Bordeaux! :-)

There's plenty to discuss in Bordeaux, we'll have to be careful that we don't have each and everyone of them while sipping Bordeaux, otherwise we might end up with some rather "creative" solutions ;-)

Hadrien

dauwhe commented 8 years ago

Daniel Weck wrote (in the original thread):

a media fragment attribute, which by the way doesn't use the same syntax as most other fragment identifiers (epubcfi(...) instead of epubcfi=...)

Inspired by our beloved XPointer scheme :)

<button 
   xlink:type="simple" 
   xlink:href="#xpointer(here()/ancestor::slide[1]/preceding::slide[1])"> 
Previous 
</button>

https://www.w3.org/TR/xptr-xpointer/

dauwhe commented 8 years ago

Makoto wrote (in the original thread):

Here is why I think something like EPUBCFI is needed.

In EPUB 3, package documents provide contexts necessary for interpreting content documents. The most notable example is navigation to the next content document in the spine. To find the next content document, reading systems are required to read package documents in advance. The same thing applies to navigation between multiple renditions of a single publication.

But explosion of EPUB publications allows content documents to have URIs. Browsers can bypass manifestations and directly access content documents. This is not necessarily harmful, since some users might not be interested in navigation within a publication.

But when users are interested in navigation within a publication, what is our scenario? One scenario is that reading systems should always begin with manifestations, and they should access content documents using URIs in manifestations. In other words, content documents should not be directly accessed by URIs. But this scenario looks unsatisfactory to me, since content documents become second-class citizens.

I would like to make manifestations discoverable from some URIs of content documents. Such URIs allow content documents to become first-class citizens and also allow navigation within publications. Although EPUBCFIs are not designed for exploded publications, we can easily get URIs of EPUB publications from EPUBCFIs.

Regards, Makoto

dauwhe commented 8 years ago

Makoto wrote (in the original thread):

Nevertheless, I am not a fan of EPUBCFIs from the beginning.

Some considerations strongly influenced the design of EPUBCFI [1]. I would like to revisit them. For example:

All fragment identifiers that reference the same logical location should be equal when compared.

Comparison operations, including tests for sorting and comparison, should be able to be performed without accessing the referenced files.

[1] http://www.idpf.org/epub/linking/cfi/epub-cfi.html#sec-overview-purpose-and-scope

Regards, Makoto

dauwhe commented 8 years ago

Hadrien Gardeur wrote (in the original thread):

Hello Makoto,

There are several possibilities here:

first of all, if you point to a content document directly using its URI, you could still discover that it's part of a publication using the link@rel="manifest" that will be included in the HTML header
I also think that we could provide some additional context, and instead of trying to fit every single bit of information in a IRI, we can express that information in a document (for example an Open Annotation document). A more stable and useful annotation would rely on:
- a URI to the content document
- a URI to the manifest (to provide the context)
- maybe even an identifier for the publication
- a fragment identifier to point precisely in the content document (robust linking mechanism)
- text surrounding the target annotation (fuzzy linking)
media fragments have the track dimension, identified by its name, maybe a similar mechanism could be used to identify the manifest or the publication

You're absolutely right that this is a balancing act between putting the content document and the manifest first.

Hadrien

dauwhe commented 8 years ago

Daniel Weck wrote (in the original thread):

The following key CFI qualities have been very valuable when implementing support in Readium for the Mapping Document of EPUB3 Multiple Renditions, bringing notable benefits in terms of processing / matching performance (compared to when using the most natural alternative in XML/HTML, i.e. fragment identifiers):

CFI expressions are canonical (a logical location has only one representation, unlike ; for example ; XPath).
CFI references are comparable / sortable without having to parse the target DOM.

That being said, the replacement of fragment IDs (which depend on runtime knowledge of the targeted DOMs) with CFI references (which can be compared "offline") is effectively an optimization pass. Arguably, this is better performed once-and-for-all as part of the content production workflow, so that reading systems can execute efficiently. Quite understandably though, this is proving to be an unrealistic expectation on the part of content "authors" / publishers.

So in practice, reading systems must bare the upfront cost of building an optimized cache of DOM-CFImappings, to enable cheap processing / matching at runtime. I think it's fair to say that due to its (perceived?) inconvenient nature in authoring / production contexts, CFI is relegated to a mere internal implementation detail in reading systems.

To some extent, the same can be said for annotations (selection highlight down to the level of character ranges) and bookmarking (again, at the character level), in the sense that reading systems handle both the creation and consumption of generated CFI expressions. This has little to do with upstream authoring / production tools. The main challenge here is that we need an interoperable persistence format, ideally one that integrates well ; syntactically-speaking ; within URI references.

EPUB3 OA ( http://www.idpf.org/epub/oa/#h.hkfyy9z2lzib ) proposes a model akin to Hadrien's suggested list. Notably: "a fragment identifier to point precisely in the content document (robust linking mechanism)" could be based on the "rightmost" part of CFI expressions, i.e. the canonical syntax for a path / location inside an HTML document. The ordered "spine" level of indirection (denoted by the "!" exclamation mark in a CFI reference) can be loosely replaced with additional data (separate field, as per Hadrien's example). In fact, for practical reasons Readium internally distinguishes the spine item position (in the OPF) from the CFI character range / position. I think that CFI ; as we know it now ; is bound to be "deprecated" in BFF, but there is no practical equivalent to CFI character locators on the web today, right? (not XPointer, and not Media Fragments)

Daniel

dauwhe commented 8 years ago

Ivan Herman wrote (in the original thread):

This mail was rather triggered by Danie's mail, and not a direct reply. It may be a bit of a diversion, ie, and a longer term discussion...

On 20 Feb 2016, at 11:28, Daniel Weck danie...@gmail.com wrote:

> EPUB3 OA ( http://www.idpf.org/epub/oa/#h.hkfyy9z2lzib ) proposes a > model akin to Hadrien's suggested list. Notably: "a fragment > identifier to point precisely in the content document (robust linking > mechanism)" could be based on the "rightmost" part of CFI expressions, > i.e. the canonical syntax for a path / location inside an HTML > document. The ordered "spine" level of indirection (denoted by the "!" > exclamation mark in a CFI reference) can be loosely replaced with > additional data (separate field, as per Hadrien's example). In fact, > for practical reasons Readium internally distinguishes the spine item > position (in the OPF) from the CFI character range / position. I think > that CFI ; as we know it now ; is bound to be "deprecated" in BFF, but > there is no practical equivalent to CFI character locators on the web > today, right? (not XPointer, and not Media Fragments) My reflection is on the last remark; we are having some discussion currently in the Annotation Working Group and the feedback of this group may be important. My apologies for the diversion and that it may be a bit longish. The Web Annotation Working Group[1] is working on a general Annotation model (and protocol). Some of you may already know the earlier version, published by a community group[2], which has also been adopted by EDUPUB. The WG is now in a finishing phase of the technical work for the new version of the model which ought to become a Recommendation by the end of the year (provided it goes through all the hurdles of implementations). The latest version of the model is the editor's draft of the model itself[3], described fully in JSON. (There is also an underlying, formal RDF vocabulary in the making which is of interest for RDF heads only; for the aficionados, the JSON version is in fact JSON-LD, relying on that vocabulary. But user may not want to know that.) The Annotation model has several means to refer to a target, ie, where an annotation can be anchored. The relevant sections in the document are Selectors[5] and States[6]. The former provides a common JSON vocabulary and structure to describe things like "the target is that and that text interval", "the target is what is selected by this and this CSS media query", etc. (The list is not yet final, there will probably be an XPath selector, for example.) States have the same role in terms of, say, the time stamp of a resource. Furthermore, there will be a possibility to define combinations of selectors, ie, something like "find some text via a CSS query, and then select a text portion beginning by this text and ending by that". (The exact syntax of how to express this combination is still under discussion.) The bottom line is that the selector mechanism gives a powerful way to select a specific portion, target, etc, of a file, let that be SVG or HTML (or possibly others). My personal feeling is that this mechanism is very valuable and useful, regardless of whether it is used for annotations or for something else. As a consequence, I raised an issue in the WG that this mechanism should be made useful beyond annotations[6]. I think there is a general agreement for that; there is a very technical discussion whether the underlying RDF vocabulary should use a different namespace for the selector constructs or not which, in the grand scheme of things, is a detail. There is also an agreement (I think, although the issue is still open) that there is no need for a separate Recommendation, because the specification in [4](and possibly [5]) is enough but, to make it more palatable to outsiders, it may be worth creating a separate document (a W3C WG Note) that describes only the selection mechanism. So, here is my first question: would such a facility be useful for this community? My belief is 'yes', but I am obviously biased. The second, related question is a bit more complicated. A selector is not a URI (as opposed to a fragment ID, for example), but a JSON structure. In some use cases (eg, in RDF) there may be a need to express the selectors as URI-s as well. Defining such a URI by, essentially, copying the syntax of selectors into a fragment may become possible and may not be very complicated; one could imagine something like: http://www.ex.org/ex.html#selector(type=TextQuoteSelector,exact="anotation",prefix="this is an",suffix="that has some") (I know, the URI has to be normalized, but let us forget about that now.) Ain't very pretty, it is reminiscent to the complexity of CFI, but is more firmly grounded in something that has a stable specification and implementations. Ie, the fragment ID is "just" a shorthand for something more general. The problem with this is more stupid, because non-technical. At the moment, there is no clear an appropriate way of defining and mainly registering a new fragment ID for an already existing media type (in this case HTML). And, as usual, this may lead to much controversy, because if we defined the selector fragments for HTML, that may mean some sort of an implicit requirement that browsers should implement this, something which may be pushed back. Ie, not clear how to handle that. _But_, if we put that aside, here is my second question: provided your answer to my previous question is 'yes, selectors are useful', is it important for this community to express them as URI-s? Again, sorry for the diversion… Ivan [1] https://www.w3.org/annotation/wiki/Main_Page [2] http://www.openannotation.org/spec/core/ [3] https://w3c.github.io/web-annotation/model/wd2/ [4] https://w3c.github.io/web-annotation/model/wd2/#selectors [5] https://w3c.github.io/web-annotation/model/wd2/#states [6] https://github.com/w3c/web-annotation/issues/110 --- Ivan Herman, W3C Digital Publishing Lead Home: http://www.w3.org/People/Ivan/ mobile: +31-641044153 ORCID ID: http://orcid.org/0000-0003-0782-2704

dauwhe commented 8 years ago

Hadrien Gardeur wrote (in the original thread):

I beilieve that selectors are indeed useful and that we need more than one to have a robust solution, especially with content documents that can be updated on a regular basis.

Regarding the serialization of such selectors as a URI, I'm not entirely convinced that this is a requirement. For bookmarks and annotations, we can structure this info as a document (arbitrary JSON or JSON-LD), and while we won't be able to provide links that point precisely into a document, that's no different from what we have on the Web currently, plus we just decided to deprecate CFI in content documents too.

There are requirements such as rendition mappings for which the situation is different, but the requirements for them are vastly different from annotations/bookmarks. Since rendition mappings are authored by the content creator, there's no need for a selector based system and the same robustness in linking back to the content document.

dauwhe commented 8 years ago

Ivan Herman wrote (in the original thread):

Thanks Hadrien!

On 22 Feb 2016, at 16:55, Hadrien Gardeur hadrien...@feedbooks.com wrote:

I beilieve that selectors are indeed useful and that we need more than one to have a robust solution, especially with content documents that can be updated on a regular basis.

Regarding the serialization of such selectors as a URI, I'm not entirely convinced that this is a requirement. For bookmarks and annotations, we can structure this info as a document (arbitrary JSON or JSON-LD), and while we won't be able to provide links that point precisely into a document, that's no different from what we have on the Web currently, plus we just decided to deprecate CFI in content documents too.

Understood. My worry is more about RDF based vocabularies that are based on URI-s.

Thanks!

Ivan

dauwhe / epub31-bff

CFI in BFF #10