alan-turing-institute / defoe

Code to analyse books and newspapers data using Apache Spark.
MIT License
17 stars 3 forks source link

Decouple object model from storage format #11

Open mikej888 opened 5 years ago

mikej888 commented 5 years ago

defoe/alto|books|fmp/archive.py each define Archive though Archive is not part of the data model (books/pages) but part of how the data is bundled. It should be possible, for example, to run queries over ALTO-compliant books that are not in a ZIP files too. Complementarily, it should be possible to run queries over British Library Newspapers which are in ZIP files.

mikej888 commented 5 years ago

To get BLN/TDA article text requires pulling out content from the article's element in a single XML document.

In Papers Past New Zealand newspapers data set, each issue is spread across multiple XML files and the results are at the article level. To reconstruct issues would require parsing multiple XML documents and reconstructing each issue from its articles, using newspaper names and publication dates/times as the linking criteria. This would be helped with more understanding of how these XML documents were accessed from the PP NZ API (i.e. what queries were run over that API) but it would be non-trivial I suspect.

The challenge arising from this data set is similar to that for LWM and the Find My Past newspapers ALTO subset too. In that case, articles are spread across multiple XML files, each representing a page of a specific issue of a newspaper. While the notion of issue is present, an article may span multiple XML documents and an XML document may have multiple articles. To get article text may require getting, from the metadata document, information on the pages, and so the XML documents, which contain the article's text then parsing each of these in turn. This may be complicated by the fact that the page XML documents don't seem to have any metadata specifically identifying articles. This may then incur having to searching the page text for the article title as recorded in the metadata document.

There seems to be the need for two layers:

mikej888 commented 5 years ago

This may then make #10 a bit easier.