IIIF / iiif-stories

Community repository for documenting stories and use cases related to uses of the International Image Interoperability Framework.
21 stars 0 forks source link

OCR Word confidence in Annotations #68

Open glenrobson opened 7 years ago

glenrobson commented 7 years ago

Description

I am a harvester of IIIF content who would like to use the OCR word confidence in my index.

Variation(s)

Proposed Solutions

Some way of adding OCR word confidence from ALTO to IIIF Annotations.

Additional Background

This use case came up for Newspapers but I believe it is more widely applicable. Example Alto:

http://dams.llgc.org.uk/behaviour/llgc-id:3100022/fedora-sdef:alto/getAlto

and IIIF annotation list:

http://dams.llgc.org.uk/iiif/3100022/annotation/list/ART1.json

I believe WC is word confidence:

<String ID="PAG_1_ST1" STYLEREFS="TXT_2" HPOS="921" VPOS="2937" HEIGHT="123" WIDTH="246" WC="0.99" CONTENT="Just"/>
jronallo commented 7 years ago

hOCR also provides word confidence in the x_wconf value.

https://docs.google.com/document/d/1QQnIQtvdAC_8n92-LhwPcjtAUFwBlzE8EWnKAxlgVf0/preview

cneud commented 7 years ago

@glenrobson Yes, "WC" is used for "word confidence" in ALTO. Please note that there is an ongoing discussion with regard to how confidence values should be derived and expressed in future ALTO versions: https://github.com/altoxml/schema/issues/23.