Git-Lit / git-lit

Scripts to create git repositories for ALTO XML texts, like those from the British Library's scanned documents.
31 stars 8 forks source link

extract plain text files from ALTO XML #4

Closed JonathanReeve closed 8 years ago

tfmorris commented 8 years ago

I believe this has already been done in the source files. For example,

https://github.com/Git-Lit/000624240/blob/master/ALTO/000624240_000003.txt

is the extracted text from the page 3 ALTO file:

https://github.com/Git-Lit/000624240/blob/master/ALTO/000624240_000003.xml

The two things which could use improving are:

JonathanReeve commented 8 years ago

Since only three out of the four sample texts have extracted plaintext files, I'm inclined to think that not all of the British Library texts will have extracted plaintext. And since the existing .txt files don't look great, as you noticed, it might be a good idea to extract plaintext directly from the ALTO XML. I imagine this could be done fairly easily with a little Xpath magic.

tfmorris commented 8 years ago

I didn't realize that the volumes don't come with a consistent set of files. (Re)Extracting the text from the original source makes sense.

hOCR (ie HTML) might be another useful format since it's displayable directly in web browsers.

tfmorris commented 8 years ago

A first pass of this is available in ea401dd. It puts a big ugly "--- Page Marker ---" between every pair of pages and doesn't make any attempt to identify paragraphs or merge words/paragraphs which span page boundaries, but it does get all the raw text out.