Git-Lit / git-lit

Scripts to create git repositories for ALTO XML texts, like those from the British Library's scanned documents.
31 stars 8 forks source link

Identify signature numbers & stray headers #33

Open tfmorris opened 8 years ago

tfmorris commented 8 years ago

Older books may have signature numbers on pages which need to be removed or moved out of line. It doesn't appear that ABBYY's layout analysis reliable identifies them as being part of the bottom margin.

It's pretty good about tagging headers in the example I looked at (sample size = 1), but 2 or 3 did slip through (out of 150), so we'll probably need to be prepared to look for them as well.

tfmorris commented 8 years ago

In addition to signature numbers/marks, catch words will also need to be identified/removed. Page footers may occur in addition to or instead of headers.

Footnotes & endnotes are related, but require separate treatment, so I'll create a separate issue for them.