UCL / i_newspaper_rods

Working with iRods to analyse the Times Digital Archive
MIT License
1 stars 3 forks source link

Return results in yaml with article title #13

Open raquelalegre opened 6 years ago

raquelalegre commented 6 years ago

For example, for this Xpath:

/GALENP/Newspaper/issue/page[4]/article[2]

we opened file:

rd009s/2TB-Drive-Transfer-06-07-q2016/TDA_GDA_1785-2009/1990/19900201/0FFO-1990-0201.xml

And searched for page 4 and 5 and articles 2 and 3 (because we are unsure about index 0 corresponding to page 1, etc.), but we couldn't find the words the yaml says are mentioned, in this case:

- liverpool
- london
- manchester

@klapaukh any chance you can regenerate the article matches with the article title as well? That'd help locating where the mentions are.

raquelalegre commented 6 years ago

We also found some XMLs have more than one GALENP element in them. It might be a supplement or something, I haven't had the time to check. This problem is also happening in the XML above.

The yaml right now includes this XPath to an article:

01-02-1990:
[...]
  - /GALENP/Newspaper/issue/page[56]/article[3]

And the XML for that issue doesn't have a page 56, it has instead 2 GALENP elements, one with 39 pages and another with 22, so I guess your XPath should instead be:

  - /GALENP/Newspaper[2]/issue/page[17]/article[3]

(Or page 16 if the IDs start on 0 and not 1). I think that instead of changing this i the model, which could be a pain, we could do with just having the unique ID of the article which I just found to be in this XPath:

  /GALENP/Newspaper/issue/page/article/id

Values for this look like:

0FFO-YYYY-MMDD-PPPP-AAA

with PPPP being the page number starting on 1 in 4 digits, and AAA being the article ID in 3 digits, starting also in 1. E.g. 0FFO-1990-0201-0021-001. Tessa knows already how to find this, and that'd help her locate the articles in the XML where the title is.

klapaukh commented 6 years ago

I cannot reproduce this error. I tried getting the file from WOS using the oids and via cybeduck through the gui. In both cases I got xml containing only a single GALENP / newspaper / issue, and the xpath expressions (which are 1 indexed) return sensible articles for me. I tested the xpath using using firefox's xpath engine (at the console you can run $x("/some/xpath/")).

The first article's id is 0FFO-1990-0201-0004-00 and the second one's is 0FFO-1990-0201-0056-003.

raquelalegre commented 6 years ago

Something is wrong here, it sounds like we are looking at different files. I can't find any 0FFO-1990-0201-0056-003 ID in the 0FFO-1990-0201.xml file. There is nothing starting with 0FFO-1990-0201-0056 because there is no page marked as 56. And there are two GALENP elements. I've sent the file to you on Slack. Can you send us the title of the article/s, to try and find it in page 56 of the scans for the day?