SciCrunch / scibot

curation workflow automation and coordination
Apache License 2.0
41 stars 12 forks source link

capture TextPositionSelector and/or RangeSelector #29

Open judell opened 5 years ago

judell commented 5 years ago

As per https://github.com/hypothesis/product-backlog/issues/1022, Hypothesis fails to distinguish among targets that share a common prefix and exact but differ in suffix. Annotations for multiple such targets pile up on a single highlight, preventing human curators from navigating to, and responding to, each target.

One solution would be to run SciBot in the web page where it would have DOM access and could reuse the Hypothesis anchoring libraries. In the near term that would require a rewrite to JavaScript which would make it a nonstarter. In the longer term it's possible that web assembly will enable packaging the existing Python-based code into a form usable in the browser, and that's worth bearing in mind.

The other solution would be to replicate, in the Python-based SciBot code, the selectors produced by the Hypothesis JS-based anchoring machinery. There are two possibilities here: match the TextPositionSelector that Hypothesis produces, or match the RangeSelector (xpath) that Hypothesis produces. I'd be willing to investigate the feasibility of these strategies.

tgbugs commented 5 years ago

Huh, 1022 explains a lot.

How much infrastructure would we need to have the bookmarklet load a helper script from a static url, so that the bookmarklet stays the same but we can add functionality like this? I'm thinking a single additional endpoint? Any known CORS issues with loading a remote script from a bookmarklet? Also, do we need the full rendered DOM to be able to get the xpaths or can we extract them from document.innerHtml? A problem I see with that approach would be mapping the ids found in the inner text back onto the innerHtml in cases where some markup splits an id (which is now quite frequent due to journals having completely whiffed on the typesetting ...).

Webasm on my radar, though taking a look around I found https://github.com/iodide-project/pyodide which is ... not reassuring with regard to the current complexity of the setup required, would have to evaluate time tradeoffs between working on that vs a complete rewrite.

judell commented 5 years ago

How much infrastructure would we need to have the bookmarklet load a helper script from a static url, so that the bookmarklet stays the same but we can add functionality like this?

That's a separate question to which the answer I think is "just do it" :-) There are only a handful of curators who have installed the bookmarklet, right? A one-time upgrade to a bookmarklet that's a stub pointing to malleable code is a pretty small intervention.

Any known CORS issues with loading a remote script from a bookmarklet?

The possible issue is CSP (Content Security Policy). I'm not aware that any of our target sites enforce CSP. If any do, the fallback would be to package the thing in a dirt-simple Chrome extension.

do we need the full rendered DOM to be able to get the xpaths or can we extract them from document.innerHtml?

It's ideal to operate in DOM context using the same code Hypothesis (and compatible clients) use, all based on the common anchoring libraries.

That said, it may be easy to match TextPositionSelector by stripping markup from the innerText you get and marking positions in the stream of characters. In principle it seems possible to easily match the TextPositionSelectors that the Hypothesis client produces. In practice we'll just have to try and see what happens.

judell commented 5 years ago

It looks like the following will work.

  1. Send document.body.textContent instead of document.body.innerText

  2. Use the start of the RRID match in the textContent stream as TextPosition.start

  3. Use TextPosition.start + length of RRID match as TextPosition.end

I have verified that:

a) with Range (XPATH) anchoring turned off, the Hypothesis client will anchor a case like hypothesis/product-backlog#1022 when it has both TextQuote and TextPosition

b) The start of an RRID match in the textContent stream does match the TextPosition.start created by the Hypothesis client

It would, of course, be a major change for SciBot to be looking at document.body.textContent (unparsed HTML) vs document.body.innerText (just the text), so this would require some testing and sanity-checking.

I'll take a crack at making a demo that illustrates how, given the textContent of a web page, to create Hypothesis-compatible selectors for both TextQuote and TextPosition.