edrlab / thorium-reader

A cross platform desktop reading app, based on the Readium Desktop toolkit
https://www.edrlab.org/software/thorium-reader/
BSD 3-Clause "New" or "Revised" License
1.86k stars 157 forks source link

[Annotations import/export]: support of the annotation selectors from w3c annotation spec #2625

Open panaC opened 3 weeks ago

panaC commented 3 weeks ago

Currently we can export and import an annotations set with the readium Annotation spec but the annotation matching selector is locked with the r2-navigator-js IRangeInfo model. We need an interface to accept/parse any annotation selectors from the w3c annotation spec.

Support of the w3c annotation data model selectors https://www.w3.org/TR/2017/REC-annotation-model-20170223/#selectors :

Need to update the readium annotator spec https://github.com/readium/annotations?tab=readme-ov-file#111-selector to fully support w3c annotation selector model.

FragmentSelector :

TextFragment :

conformsTo application/xhtml+xml Fragment Identifier Spec and scroll to text fragment spec https://wicg.github.io/scroll-to-text-fragment/

{
      "type": "FragmentSelector",
      "conformsTo": "http://tools.ietf.org/rfc/rfc3236",
      "value": "#:~:text=The%20first%20recorded,Williams"
}

Not supported yet, need to found a strong library to handle this.

audiobook media flags:

{
    "type": "FragmentSelector",  
    "conformsTo": "http://www.w3.org/TR/media-frags/",
    "value": "t=30,60" 
} 

CssSelector :

example :

{
     "type": "CssSelector",
     "value": "tr:nth-child(21) > td:nth-child(1)",    
}

refined by a textPositionSelector inside the node container

{
    "type": "CssSelector",
    "value": "#original-content",
    "refinedBy": {
         "type": "TextPositionSelector",
         "start": 58,
         "end": 138
    }
}

Supported on apache-annotator


// CssSelector with textPositionSelector refined to DOM Range
import {
  makeRefinable,
  type CssSelector,
  type Selector,
} from "@apache-annotator/selector";

const createMatcher = makeRefinable((selector: Selector) => {
  const innerCreateMatcher = {
    TextPositionSelector: createTextPositionSelectorMatcher,
    CssSelector: createCssSelectorMatcher
  }[selector.type];

  if (!innerCreateMatcher) {
    throw new Error(`Unsupported selector type: ${selector.type}`);
  }

  return innerCreateMatcher(selector);
});

const selector = {
    "type": "CssSelector",
    "value": "#original-content",
    "refinedBy": {
         "type": "TextPositionSelector",
         "start": 58,
         "end": 138
    }
}

matchAll = createMatcher(selector)
const ranges = matchAll();

// range to CssSelector

const describeRangeCssSelector = async (range: Range): Promise<CssSelector> => {

 // normalize the range, start and end container will be TEXT_NODE element
 // https://github.com/readium/r2-navigator-js/blob/6cf8c24de79d59ce649c6a38a714ddce13a932c1/src/electron/renderer/webview/selection.ts#L768
 const rangeNormalize = normalizeRange(range);

  // find the nearest common ancestor element
  const commonAncestorHTMLElement = ...

 // https://github.com/edrlab/thorium-reader/blob/b2d06b12809c9eaf140e1a135b769baacf05f07d/src/utils/search/cssSelector.ts#L24
  const selector = getCssSelector(commonAncestorHTMLElement);

  return {
    type: "CssSelector",
    value: selector.
    refinedBy: await describeTextPosition(
      rangeNormalize,
      commonAncestorHTMLElement,
    ),
  };
};

xPathSelector

ex:

{
    "type": "XpathSelector",
    "value": "/html/body/p[2]/table/tr[2]/td[3]/span",
    "refinedBy": {
         "type": "TextPositionSelector",
         "start": 58,
         "end": 138
    }
}

Not supported both in apache-annotator and r2-navigator-js

Need to think how to deal with this selector, and if it will be parsed.

Note: used with the hypothesis client https://github.com/hypothesis/client/blob/main/src/annotator/anchoring/xpath.ts

TextQuoteSelector

ex:

 {
    "type": "TextQuoteSelector",
    "exact": "Combien de fois \n\n    ne m’avait-il",
    "prefix": "trouver quelqu’un    \n  \t\t    comme vous. ",
    "suffix": " pas \n\n      reproché de travailler ma"
 }

Supported on apache-annotator Do not generate with LCP protection publication : Note from w3c spec :

Note If the content is under copyright or has other rights asserted on its use, then this method of selecting text is potentially dangerous. A user might select the entire text of the document to annotate, which would not be desirable to copy into the Annotation and share. For static texts with access and/or distribution restrictions, the use of the Text Position Selector is perhaps more appropriate.

Implementation with Apache-annotator : https://annotator.apache.org/docs/api/modules/selector.html#textquoteselectormatcher

TextPositionSelector

ex:

{
    "type": "TextPositionSelector",
    "start": 1876,
    "end": 1880
}

apache annotator implementation: https://annotator.apache.org/docs/api/modules/selector.html#textpositionselectormatcher

RangeSelector

ex:

{
    "type": "RangeSelector",
    "startSelector": {
        "type": "CssSelector",
        "value": "p:nth-child(24)",
        "refinedBy": {
            "type": "TextPositionSelector",
            "start": 28,
            "end": 32
        }
    },
    "endSelector": {
        "type": "CssSelector",
        "value": "p:nth-child(24)",
        "refinedBy": {
            "type": "TextPositionSelector",
            "start": 32,
            "end": 88
        }
    }
}

supported on apache-annotator

range to RangeSelector :

Just a POC example, need to test it !

const describeRange = async (range: Range): Promise<RangeSelector> => {
  const rangeNormalize = normalizeRange(range);

  const startIsElement =
    rangeNormalize.startContainer.nodeType === Node.ELEMENT_NODE;
  if (startIsElement) {
    return undefined;
  }
  const startContainerHTMLElement =
    rangeNormalize.startContainer.parentNode instanceof HTMLElement
      ? rangeNormalize.startContainer.parentNode
      : undefined;
  if (!startContainerHTMLElement) {
    return undefined;
  }

  const endIsElement = range.endContainer.nodeType === Node.ELEMENT_NODE;
  if (endIsElement) {
    return undefined;
  }
  const endContainerHTMLElement =
    rangeNormalize.endContainer.parentNode instanceof HTMLElement
      ? rangeNormalize.endContainer.parentNode
      : undefined;
  if (!endContainerHTMLElement) {
    return undefined;
  }

  const startAndEndEqual =
    startContainerHTMLElement === endContainerHTMLElement;
  const startContainerHTMLElementCssSelector = finder(
    startContainerHTMLElement,
  );
  const endContainerHTMLElementCssSelector = startAndEndEqual
    ? startContainerHTMLElementCssSelector
    : finder(endContainerHTMLElement);

  return {
    type: "RangeSelector",
    startSelector: {
      type: "CssSelector",
      value: startContainerHTMLElementCssSelector,
      refinedBy: {
        type: "TextPositionSelector",
        start: rangeNormalize.startOffset,
        end: startAndEndEqual
          ? rangeNormalize.endOffset
          : rangeNormalize.startContainer.data.length,
      },
    },
    endSelector: {
      type: "CssSelector",
      value: endContainerHTMLElementCssSelector,
      refinedBy: {
        type: "TextPositionSelector",
        start: rangeNormalize.endOffset,
        end: rangeNormalize.endContainer.data.length,
      },
    },
  };
};

rangeSelector is parsable without DOM content loaded in memory, with just a mapping to the r2-navigator-js IRangeInfo

{
    startContainerElementCssSelector: selector.startSelector.value,
    startOffset: selector.startSelector.refinedBy.start,
    endContainerElementCssSelector: selector.endSelector.value,
    endOffset: selector.endSelector.refinedBy.start,
}

rangeSelector matched implemented here with apache-annotator usable like other selector.

danielweck commented 3 weeks ago

Not supported yet, need to found a strong library to handle this.

I ported the relevants parts of the official polyfill from Javascript to Typescript, I adapted the logic to meet our needs and I fixed some bugs (I reported them upstream too):

...unfortunately at this stage my work remains in a branch because the DOMRange-to-TextFragment logic fails in a reproducible manner with my goto test ebook (AccessibleEPUB3) and I ran out of time to troubleshoot further.

https://github.com/readium/r2-navigator-js/compare/develop..feat/text-fragments

danielweck commented 3 weeks ago

refined by a textPositionSelector inside the node container

{
    "type": "CssSelector",
    "value": "#original-content",
    "refinedBy": {
         "type": "TextPositionSelector",
         "start": 58,
         "end": 138
    }
}

This doesn't make sense. A CSSSelector references a DOM Element, and the start/end integers reference character positions inside a DOM TextNode (a DOM element can contain no children, can contain multiple sibling child text nodes although this is typically normalised into a single TextNode unless the CDATA is marked as such, can contain mixed / interspersed TextNode + Elements at the first children layer and so on recursively / deeper in the descendants)

panaC commented 2 weeks ago

Follow internal discussion with the team.

I will try to explain how works the dom range serialisation in r2-navigator-js.

Dom Range is the representation of a range of a start and end element. It could be 2 TEXT_NODE with the start and end offset at character level or it could be 2 ELEMENT_NODE with the start and end offset at the child index.

Dom Range is serialise in r2-navigator-js to an object of 6 values

{
   startContainerElementCssSelector: string;
   startContainerChildTextNodeIndex: number;
   startOffset: number;
   startContainerElementCssSelector: string;
   startContainerChildTextNodeIndex: number;
   startOffset: number;
}

A css Selector cannot reference a TEXT_NODE like Daniel said, so we have to specify the index of the TEXT_NODE in function of the parent element. In that case the full possibility of a DOM Range is preserved and can be fully recreated.

When an element is an ELEMENT_NODE : childTextNodeIndex value is set to -1 and ignored , cssSelector value and offset is enough to get back the Range. When an element is other that an ELEMENT_NODE like a TEXT_NODE most of the time, the cssSelector is targeted to the parent of the TEXT_NODE for example, that means the childTextNodeIndex is the position of the TEXT_NODE in the parent tree, so we can get the complete serialisation of a DOM range.

So this is the reason why we lost information with a cssSelector refined by a textPositionSelector, it need to reconstruct the structure of the DOM Element with only a start and end character without to know what TEXT_NODE index element is targeted. textPositionSelector has to travel the graph to extract every text length recursively up to obtain the position of the TEXT_NODE index wanted.

XPath doesn't have this issue, since we can serialise any kind of element like TEXT_NODE with text()[index]. For example with a valid XPath refinedBy:

{
    "type": "RangeSelector",
    "startSelector": {
        "type": "CssSelector",
        "value": "p:nth-child(24)",
        "refinedBy": {
            "type": "XPathSelector",
            "value": "/text()[2]",
            "refinedBy": {
                "type": "TextPositionSelector",
                "start": 28,
                "end": 32
            }
        }
    },
    "endSelector": {
        "type": "CssSelector",
        "value": "p:nth-child(24)",
        "refinedBy": {
            "type": "XPathSelector",
            "value": "/text()[2]",
            "refinedBy": {
                "type": "TextPositionSelector",
                "start": 32,
                "end": 88
            }
        }
    }
}

I hope it more clear, at least for me.

The question now is whether we should trust textPositionSelector ?

panaC commented 2 weeks ago

An another question will be how to import these annotations selector that need to be converted to IRangeInfo ?

currently we can import a Readium annotation set format aka .annotation from both library and reader windows and will be processed in the main process. If the selector cannot be mapped to IRangeInfo "offline" (without DOM mounted), the selector will not be imported to publication annotation list saved in thorium database. So we need an adapter to import any selector and convert it to Dom Range info and then r2-navigator-js IRangeInfo.

There are some constraints :

panaC commented 2 weeks ago

The use case to import annotation set in Thorium can be this :

  1. user import an annotation set from a drag-&-drop of the file in library/reader
  2. or user import the annotation set by click on import annotation button in library publication catalog menu or reader annotation modal.
  3. trigger the main process import routine
  4. If file path not provided ask to user the path of the annotation set with a system dialog
  5. read and parse/validate the annotation set content
  6. check if the annotation set belongs to the publication chosen
  7. check date/uuid conflict between annotations incoming and annotations already owned by the publication
  8. trigger the import modal to cancel the import or choose to import all annotations incoming even them in conflict or choose to import no conflict annotations
  9. the user choice and trigger an action dispatched back to the main process
  10. the main process make a queue of incoming annotations ready to be imported in the publication reader
  11. when the publication reader is opened by the user, launch of the import routine in DOM context.
  12. In reader process, loop on each incoming annotations previously queued to trigger the convert Selector to Range routine.
  13. once the annotations is converted to r2-navigator-js IRangeInfo, pushed to reader annotations list and persisted in disk.

Currently 1, 10, 11, 12 and even 13 is not implemented in develop branch

The most current priority will be the “convert Selector to Range” routine.

panaC commented 2 weeks ago

selectors highlight demonstration : https://github.com/edrlab/w3c-annotation-selector-demo https://edrlab.github.io/w3c-annotation-selector-demo/web/

panaC commented 1 week ago

I propose a selector that can be mapped to IRangeInfo without DOM context :

{
    "type": "RangeSelector",
    "startSelector": {
        "type": "CssSelector",
        "value": "#intro > p:nth-child(2)",
        "refinedBy": {
            "type": "TextNodeIndexSelector",
            "value": 0,
            "refinedBy": {
                "type": "CodeUnitSelector",
                "value": 4
            }
        }
    },
    "endSelector": {
        "type": "CssSelector",
        "value": "#intro > p:nth-child(3)",
        "refinedBy": {
            "type": "TextNodeIndexSelector",
            "value": 2,
            "refinedBy": {
                "type": "CodeUnitSelector",
                "value": 11
            }
        }
    }
}

RangeSelector with a CssSelector and 2 new selectors to find the textNodeIndex from a normalize range and the codeUnit character index position.

can easily be mapped to IRangeInfo :

{
    "rangeInfo": {
        "endContainerChildTextNodeIndex": 2,
        "endContainerElementCssSelector": "#intro > p:nth-child(3)",
        "endOffset": 11,
        "startContainerChildTextNodeIndex": 0,
        "startContainerElementCssSelector": "#intro > p:nth-child(2)",
        "startOffset": 4
    },
    "cleanBefore": " Some text. The ",
    "cleanText": "quick brown fox jumps over the lazy dog. The lazy white dog sleeps",
    "cleanAfter": " with the crazy fox. Image wit",
    "rawBefore": " Some text.\n        The ",
    "rawText": "quick brown fox jumps over the lazy dog.\n        The lazy white dog sleeps",
    "rawAfter": " with the crazy fox.\n        Image wit"
}