Decouple object model from storage format

mikej888 commented 5 years ago

defoe/alto|books|fmp/archive.py each define Archive though Archive is not part of the data model (books/pages) but part of how the data is bundled. It should be possible, for example, to run queries over ALTO-compliant books that are not in a ZIP files too. Complementarily, it should be possible to run queries over British Library Newspapers which are in ZIP files.

mikej888 commented 5 years ago

To get BLN/TDA article text requires pulling out content from the article's element in a single XML document.

In Papers Past New Zealand newspapers data set, each issue is spread across multiple XML files and the results are at the article level. To reconstruct issues would require parsing multiple XML documents and reconstructing each issue from its articles, using newspaper names and publication dates/times as the linking criteria. This would be helped with more understanding of how these XML documents were accessed from the PP NZ API (i.e. what queries were run over that API) but it would be non-trivial I suspect.

The challenge arising from this data set is similar to that for LWM and the Find My Past newspapers ALTO subset too. In that case, articles are spread across multiple XML files, each representing a page of a specific issue of a newspaper. While the notion of issue is present, an article may span multiple XML documents and an XML document may have multiple articles. To get article text may require getting, from the metadata document, information on the pages, and so the XML documents, which contain the article's text then parsing each of these in turn. This may be complicated by the fact that the page XML documents don't seem to have any metadata specifically identifying articles. This may then incur having to searching the page text for the article title as recorded in the metadata document.

There seems to be the need for two layers:

One that reflects how the data is physically stored e.g. ZIP files, XML files.
One that exposes this in terms of a document's structure e.g. books/chapters or issues/articles

mikej888 commented 5 years ago

This may then make #10 a bit easier.

alan-turing-institute / defoe

Decouple object model from storage format #11