Living-with-machines / alto2txt

Convert ALTO XML to plain text + minimal metadata
MIT License
12 stars 1 forks source link

Added 'digitised' to the heading #55

Closed mialondon closed 1 year ago

mialondon commented 1 year ago

METS and ALTO are artefacts of the digitisation process that link metadata and transcribed text to images of the physical page; born-digital newspapers don't use the same formats. So it's a little clearer and it might also aid discoverability.

mialondon commented 1 year ago

@griff-rees yes, technically any digitised/digital text documents could use the formats.

There's a nice explanation at https://www.coloradohistoricnewspapers.org/forum/what-is-metsalto/

'METS describes the structure of the object but does not encode the actual textual content of the object. The ALTO standard fills this void by encoding the textual content of a digitized page in great detail, including styles and layouts. As well as encoding the digitized text itself ALTO encodes the spatial coordinates of every column, line, and word as it appears on the page.'

IIRC tools like Abbyy FineReader output ALTO, so you'd find transcriptions in ALTO without associated METS records.