Should the manifest have a field that defines the level of granularity an institution's text extractor supports

mjhawkins commented 12 months ago

I was wondering whether it would be useful for the manifest to contain a field that defines what level of granularity an institution's text extractor is capable of supporting (either in general or perhaps even per collection).

My rationale was this: 1) Not ever institution might wish to support the ability to extract text from arbitrary positions on a page. They might decide, for example, that page by page is all the granularity they want to support. 2) I like the idea of client ITF viewers being able to offer readers the ability to 'peek' on either side of the extract they're seeing to view it's wider context.

The ability to peek on either side of an excerpt could help to counter concerns regarding the malicious selection of text to invert its original meaning. For example, around the turn of the century several studios started selectively putting quotes on movie posters from credible critics saying things like 'A cinematic masterpiece' and 'Best Picture of the Year' - when the original review said. 'This is far from the best picture of the year and could be considered a cinematic masterpiece for insulting viewers' intelligence.

For a peek around function to work, the client viewer would need to know what sort of request to generate. There's no point trying to select a dozen words more on either side when the material can only be delivered by the page.

It likely wouldn't take much to provide this information. We could, for example, just go for the old school approach of using various bits on an integer. Bit 1 indicates whether character co-ords are supported; Bit 2 - page references; Bit 3, etc. One 8-bit number would likely be able to convey all the meaning that we'd need. I don't think there's any need to make extracting from semantic chunks machine actionable is worth attempting - just noting that it's available would be enough for me.

This also raises one additional feature that would be useful - a recommended field to record a hyperlink to a page that outlines how the institution's service works and what the terms and conditions are for the use for using it. This is distinct from the legal status for the text iteself.

rralley commented 12 months ago

Would this (item 1) not mean that someone trying to annotate a text from that institution's repository couldn't create an annotation for any fragment shorter than a page? So you wouldn't be able to link the name of a person mentioned in the text to their biographical entry in a reference work, for instance.

FWIW, I like the idea of 2 - I think it would be very worthwhile to be able to see the fragment in context, even if that simply means somehow highlighted within the whole text or section of text, rather than seeing a few words or lines either side. I seem to remember that the general view at the Cambridge workshop was that that was something that could be done in implementation without having to be covered specifically in the spec.

mjhawkins commented 12 months ago

I don't think they would be able to add annotations.

Choosing to only offer page level texts would certainly hinder usefulness, but it feels like something that one provider might want to do - even if only in the first instance. Page level support would be something that I could support quicker on my other projects. It's not as useful as full support but I'd settle for releasing an acceptable service quicker rather than having to wait much longer for a 'perfect' one.

I like the peek around feature too. Thankfully, that's just on the viewer/UI side and it would be easy for me to add it onto a sample ITF viewer. The viewer just needs to know what sort of requests it can make to provide this context - say a few tweets worth of characters on either side (if supported) or a page on either side (if that's all that's permitted).

neilsjefferies commented 11 months ago

There are two different things here. What the API does and how Fragment References might be used. At the moment, only the former is defined. So a reference to an insertion point, for example, isn't discussed, since an API call couldn't return anything for that. That's another section of the spec altogether. Thus it would be possible to annotate at a finer granularity than the API can provide if the annotation software/viewer can handle finer grained access to the returned text.

rralley commented 11 months ago

Ah, understood. I will eventually stop asking questions about bits of the spec that aren't written yet.

UDT-ITF / website

Should the manifest have a field that defines the level of granularity an institution's text extractor supports #56