extract plain text files from ALTO XML

tfmorris commented 8 years ago

I believe this has already been done in the source files. For example,

https://github.com/Git-Lit/000624240/blob/master/ALTO/000624240_000003.txt

is the extracted text from the page 3 ALTO file:

https://github.com/Git-Lit/000624240/blob/master/ALTO/000624240_000003.xml

The two things which could use improving are:

Concatenating all pages into a single text file (the likely looking candidate https://github.com/Git-Lit/000624240/blob/master/ALTO/000624240.txt is actually a mix of files names and extract text -- and not even in page order)
The extracted text files appear to be a single long line of text per page with no line or paragraph breaks. Line breaks (especially if hyphenation processing hasn't been done) and paragraph breaks would be useful additions.

JonathanReeve commented 8 years ago

Since only three out of the four sample texts have extracted plaintext files, I'm inclined to think that not all of the British Library texts will have extracted plaintext. And since the existing .txt files don't look great, as you noticed, it might be a good idea to extract plaintext directly from the ALTO XML. I imagine this could be done fairly easily with a little Xpath magic.

tfmorris commented 8 years ago

I didn't realize that the volumes don't come with a consistent set of files. (Re)Extracting the text from the original source makes sense.

hOCR (ie HTML) might be another useful format since it's displayable directly in web browsers.

tfmorris commented 8 years ago

A first pass of this is available in ea401dd. It puts a big ugly "--- Page Marker ---" between every pair of pages and doesn't make any attempt to identify paragraphs or merge words/paragraphs which span page boundaries, but it does get all the raw text out.

Git-Lit / git-lit

extract plain text files from ALTO XML #4