IIIF / iiif-stories

Community repository for documenting stories and use cases related to uses of the International Image Interoperability Framework.
21 stars 0 forks source link

As an aggregator of full-text, I want to represent, with a vocabulary, the types of blocks of text when using IIIF Annotations (such as, word, line, sentence, paragraph, block) #105

Open nfreire opened 6 years ago

nfreire commented 6 years ago

Description

Europeana aggregates full-text resulting from OCR, from data providers that apply different practices to the OCR processing. The post OCR processing is also applied differently across data providers. The aggregate full-text from Europeana has also been subject of research for allowing its processing in research infrastructures for language resources (CLARIN, most importantly), and in the near future, the results from the application of language tools that improve the structure of the full-text, may be provided to Europeana, by researchers from these infrastructures.

The Europeana Data Model is being extended to allow the representation of full-text in a compatible way to the IIIF Presentation API v3, using Web Annotations. Therefore the need for the use of a common vocabulary for representing the type of text blocks, compatible with both specifications.

aisaac commented 6 years ago

@nfreire thanks for having created this. I am puzzled about the elicited need in the end of the description, though, as well as in the title: our draft spec at https://docs.google.com/document/d/1t5yGEzQ0KV2rqU0sFDoKnI2bIDBGrmj0f1gSOCRUgJ4/ mentions that we should have "word", "line", "paragraph" and "page".