Conal-Tuohy / TroveProxy

A transforming proxy and harvester for the National Library of Australia's Trove API
Apache License 2.0
1 stars 0 forks source link

Pipeline People Data from SRU from API results #7

Closed mraadgev closed 10 months ago

mraadgev commented 10 months ago

From Trove API people results use people@id push to http://www.nla.gov.au/apps/srw/search/peopleaustralia?query=oai.identifier+%253D+**[people@id]**&version=1.1&operation=searchRetrieve&recordSchema=urn%3Aisbn%3A1-931666-33-4&maximumRecords=10&startRecord=1&resultSetTTL=300&recordPacking=xml&recordXPath=&sortKeys=

Then we can parse the output

Conal-Tuohy commented 10 months ago

I added a new pipeline step z:enhance-people-data, and I'm piping every Trove response document through that step.

Inside the step, if the response document contains at least one people record, then the make-people-australia-http-request.xsl stylesheet converts the response into a c:request document that specifies an SRU request. The href contains a long CQL query that uses OR to query for records whose oai.identifier equals any of the person/@id values in the document.

Then the Trove response and the People Australia SRU c:response are merged. The Trove XML and the SRU response are wrapped together into a single document, and the merge-eac-cpf-into-people.xsl stylesheet applies an identity transform to the Trove response, with a template to match and copy people elements which also finds their related eac-cpf record using an xsl:key and inserts it as the people element's last child.

I've tested it with http://localhost:8080/proxy/v3/result?category=people&q=Mark&n=100 which is the maximum number of Trove records we can get in one query, and hence the longest SRU query we can generate, and it does indeed produce a list of 100 people named "Mark", including their eac-cpf record.

Conal-Tuohy commented 10 months ago

The eac-cpf inclusion does slow down processing and increase the size of the response (if responses include people, at least), and one question in my mind is if we want to make this transclusion functionality optional. i.e. do we want people to be able to query for people records and not have the eac-cpf data transcluded? We could require a proxy-include-eac-cpf=true parameter to make it happen, or a proxy-include-eac-cpf=false to make it not happen.

Something to think about and discuss, both for this particular case but also more generally in terms of what kind of controls we want to give the proxy user.

Conal-Tuohy commented 10 months ago

I'm going to close this now (feel free to reopen if you find a bug), and I'll add a separate issue for the question about making the eac-cpf inclusion optional.

Conal-Tuohy commented 10 months ago

glitch in the pipeline means that eac-cpf isn't imported when accessing an individual record (in fact an error is returned)

Conal-Tuohy commented 10 months ago

The stylesheet that merged the srw and the v3 Trove API responses was only prepared to accept Trove search results, not individual records. Fixed in 29fc077