Closed JonathanReeve closed 8 years ago
Since only three out of the four sample texts have extracted plaintext files, I'm inclined to think that not all of the British Library texts will have extracted plaintext. And since the existing .txt files don't look great, as you noticed, it might be a good idea to extract plaintext directly from the ALTO XML. I imagine this could be done fairly easily with a little Xpath magic.
I didn't realize that the volumes don't come with a consistent set of files. (Re)Extracting the text from the original source makes sense.
hOCR (ie HTML) might be another useful format since it's displayable directly in web browsers.
A first pass of this is available in ea401dd. It puts a big ugly "--- Page Marker ---" between every pair of pages and doesn't make any attempt to identify paragraphs or merge words/paragraphs which span page boundaries, but it does get all the raw text out.
I believe this has already been done in the source files. For example,
is the extracted text from the page 3 ALTO file:
The two things which could use improving are: