dazza-codes / marc2linkeddata

Utilities for translating MARC21 into linked data
Apache License 2.0
5 stars 0 forks source link

Use cache expiry #1

Open dazza-codes opened 9 years ago

dazza-codes commented 9 years ago

When storing retrieved RDF in a local cache, try to add a triple that represents any cache expiry headers from the original source of the RDF (LOC, VIAF, ISNI, OCLC, etc.). Create a cron job process that can query the local cache to extract entities with cache-control data that has expired, then use a background queue that runs processes with a 'nice' priority to retrieve and update the expired RDF.

If mongodb is the local cache, create an index on the cache-control data. Where possible, encourage RDF providers to use cache-control headers with reasonable values that correspond to their routine update cycles (if they are routine). If an entire repository has the same expiry, it might be readily stored in only a few triples.

dazza-codes commented 9 years ago

See also

dazza-codes commented 9 years ago

For example, the HEAD request on this LOC authority http://id.loc.gov/authorities/names/n79044798 Response headers:

HTTP/1.1 200 OK
Content-Type: text/html; charset=UTF-8
Cache-Control: public, max-age=43200
X-PrefLabel: Byrnes, Christopher I., 1949-
X-URI: http://id.loc.gov/authorities/names/n79044798
Server: Apache
Accept-Ranges: bytes
Date: Thu, 19 Feb 2015 04:31:12 GMT
X-Varnish: 1233935832
Age: 0
Via: 1.1 varnish
Connection: keep-alive

The useful data for caching here is Cache-Control: public, max-age=43200 and the Date: Thu, 19 Feb 2015 04:31:12 GMT. In this case, the data may be cached for 12 hours after the Date; beyond that time, the cached data may be stale.

In theory, the max-age should correspond to the frequency that LOC updates this data on their servers. In practice, they don't want to manage these details, so the policy is to have the data cached only for 12 hours so that whenever it is updated, the caches will be refreshed within a day.

It may be preferable for LOC to set the Expires header with a date that corresponds to their next scheduled update for the data; see https://devcenter.heroku.com/articles/increasing-application-performance-with-http-cache-headers#expires

Another option: if LOC sets the Last-Modified header, a conditional request can be issued using an If-Modified-Since request header that has the date the data was last cached. If the response code is 304, there is no new data available (so use the cached data); see https://devcenter.heroku.com/articles/increasing-application-performance-with-http-cache-headers#time-based

Another option: LOC could set the ETag (or Entity Tag) header, which works in a similar way to the Last-Modified header except its value is a digest of the resources contents (for instance, an MD5 hash). Then a conditional request can use the If-None-Match request header with an ETag value of the cached data. If the response code is 304, there is no new data available (so use the cached data); see https://devcenter.heroku.com/articles/increasing-application-performance-with-http-cache-headers#content-based

Note on varnish headers: All requets in varnish are assigned a XID number, the X-Varnish tells you what it is, and if a cache-hit was involved, also the XID of the transaction that put the object in the cache. The Age header tells how long a particular object has been in varnish's cache.

dazza-codes commented 9 years ago

There is additional administration metadata in the RDF response; e.g. consider the RDF body in a GET request for http://id.loc.gov/authorities/names/n79044798.rdf

However, the GET request incurrs more work on the server to construct the body, the network to deliver the packets, and the client to parse the content. In addition, in this case, there is no information about the data that is designed for caching it.

SKOS metadata:

<skos:changeNote xmlns:skos="http://www.w3.org/2004/02/skos/core#">
    <cs:ChangeSet xmlns:cs="http://purl.org/vocab/changeset/schema#">
        <cs:subjectOfChange rdf:resource="http://id.loc.gov/authorities/names/n79044798"/>
        <cs:creatorName rdf:resource="http://id.loc.gov/vocabulary/organizations/dlc"/>
        <cs:createdDate rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">1979-05-22T00:00:00</cs:createdDate>
        <cs:changeReason rdf:datatype="http://www.w3.org/2001/XMLSchema#string">new</cs:changeReason>
    </cs:ChangeSet>
</skos:changeNote>
<skos:changeNote xmlns:skos="http://www.w3.org/2004/02/skos/core#">
    <cs:ChangeSet xmlns:cs="http://purl.org/vocab/changeset/schema#">
        <cs:subjectOfChange rdf:resource="http://id.loc.gov/authorities/names/n79044798"/>
        <cs:creatorName rdf:resource="http://id.loc.gov/vocabulary/organizations/dlc"/>
        <cs:createdDate rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2014-12-08T08:21:09</cs:createdDate>
        <cs:changeReason rdf:datatype="http://www.w3.org/2001/XMLSchema#string">revised</cs:changeReason>
    </cs:ChangeSet>
</skos:changeNote>

MADS-RDF metadata:

<madsrdf:adminMetadata>
    <ri:RecordInfo xmlns:ri="http://id.loc.gov/ontologies/RecordInfo#">
        <ri:recordChangeDate rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">1979-05-22T00:00:00</ri:recordChangeDate>
        <ri:recordStatus rdf:datatype="http://www.w3.org/2001/XMLSchema#string">new</ri:recordStatus>
        <ri:recordContentSource rdf:resource="http://id.loc.gov/vocabulary/organizations/dlc"/>
        <ri:languageOfCataloging rdf:resource="http://id.loc.gov/vocabulary/iso639-2/eng"/>
    </ri:RecordInfo>
</madsrdf:adminMetadata>
<madsrdf:adminMetadata>
    <ri:RecordInfo xmlns:ri="http://id.loc.gov/ontologies/RecordInfo#">
        <ri:recordChangeDate rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2014-12-08T08:21:09</ri:recordChangeDate>
        <ri:recordStatus rdf:datatype="http://www.w3.org/2001/XMLSchema#string">revised</ri:recordStatus>
        <ri:recordContentSource rdf:resource="http://id.loc.gov/vocabulary/organizations/dlc"/>
        <ri:languageOfCataloging rdf:resource="http://id.loc.gov/vocabulary/iso639-2/eng"/>
    </ri:RecordInfo>
</madsrdf:adminMetadata>
dazza-codes commented 9 years ago

The headers on a VIAF resource appear to have no cache control, e.g. a HEAD or GET on http://viaf.org/viaf/108317368/rdf

HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Content-Location: rdf.xml
Content-Type: text/xml
Content-Length: 6780
Date: Thu, 19 Feb 2015 16:46:28 GMT

Also, there is no RDF metadata about the date of creation or revision for the RDF resource at VIAF.

dazza-codes commented 9 years ago

Notes on evaluating cache-control information.

First, this is relatively easy using Firefox with a plugin for HTTP resource testing, because it gives easy access to specify additional HTTP request types, like HEAD, and additional request headers. See [https://addons.mozilla.org/en-us/firefox/addon/http-resource-test/]

When this is installed in firefox, it's available from the Tools > HTTP Resource Test menu. So, first paste in the example URI as 'http://id.loc.gov/authorities/subjects/sh85000399' in the top left box (URI) and then select the 'HEAD' request from the drop-down menu at the top right (Method). (Set no additional parameters in the 'Client Request' for the moment.) Then hit the submit button and the server response comes back in the 'Server Response' panels below, one for 'Headers' and one for the 'Body'.

Before reviewing the details of the response, let's be on the same page with regard to the meaning and purpose of a HEAD request. The HEAD request should retrieve only the response headers, without the body. The HEAD response for [http://id.loc.gov/authorities/subjects/sh85000399] does include a body; that MUST disappear. The spec is described here: http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html

To quote that document (my italics and my emphasis in bold):

This is the HEAD header response from [http://id.loc.gov/authorities/subjects/sh85000399](I'm not going to review the HEAD body response, it should not exist):

HTTP/1.1 200 OK Content-Type: text/html; charset=UTF-8 Cache-Control: public, max-age=43200 Etag: 08918596b834994da64b993ff32d21c1 X-PrefLabel: Accordion music (Jazz) X-URI: http://id.loc.gov/authorities/subjects/sh85000399 Server: Apache Content-Length: 17973 Accept-Ranges: bytes Date: Fri, 27 Feb 2015 17:27:03 GMT X-Varnish: 814603380 Age: 0 Via: 1.1 varnish

I'll try to explain how I currently understand the useful cache-control values and what can be used and might be added. A quick caveat -- my expertise in this area is limited and an authoritative understanding should be gained from consulting the standards from W3C, see mainly section 13 and especially 14.9 at http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html -- I'm using this document because it's somewhat easier to navigate and read, but there are more recent updates on this spec ( RFC2616 was replaced by multiple RFCs 7230-7237), although they might not yet be implemented in server/client code -- anyhow, I'm assuming we are working with HTTP/1.1 systems; specific sections noted below.

So these related client cache control header requests have implications for how the server responds to new requests:

From the client perspective, an interesting problem is how and why to use one or another of the cache headers. If the concern is caching the data at the byte level, the MD5 or other hashes are most useful, so perhaps the Etag is great for that. If the concern is the information, regardless of serialization format and byte representation, perhaps the 'Last-Modified' header is more interesting.

From the server side, the load on the system might be easier if the system is able to calculate a hash value for each serialization once and deliver it from a db/cache store for every HEAD request (without calculating it on the fly every time). If the server system calculates the 'Content-MD5' or 'Etag' for every response, that's going to show up in slow server performance (esp. CPU thrashing to calculate hashes). A good server stack should have this issue resolved already with an optimized cache system. With regard to 'Last-Modified', this could be even easier on the server side, if the value is derived from a db/cache date value (which is likely updated infrequently for most authority information).

If a 'Last-Modified' value is available, it could be very useful to have documentation on exactly what it means and how it might change. That explanation might correspond to policy decisions and practices for updating content. From a systems perspective, it might also provide an opportunity to index the values and provide an periodic 'updated' API that is consistent with 'If-Last-Modified' requests. If that exists, then it could be used to quickly identify only the records that have recently changed (or changed since a given date parameter). The idea here is that an 'updated' API might be more efficient than individual HEAD/GET requests on every resource (URI) to check all the 'If-Last-Modified' responses. (The context is some form of regular update that is not an entire dB bulk update and not an individual resource update, but something like an update to a subset/collection of resources.)

A consideration in this regard is explicit control over the 'Expires' header for records that have changed (14.21 at http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html). Similarly, the update issue also involves the 'If-Modified-Since' client header in a HEAD or GET request. The server can simply respond with a 304 (not modified) code; see 14.25 at http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html -- "The If-Modified-Since request-header field is used with a method to make it conditional: if the requested variant has not been modified since the time specified in this field, an entity will not be returned from the server; instead, a 304 (not modified) response will be returned without any message-body. ... To get best results when sending an If- Modified-Since header field for cache validation, clients are advised to use the exact date string received in a previous Last-Modified header field whenever possible. " So, this indicates the importance of the 'Last-Modified' date.

Bear in mind that we are on the verge of HTTP/2, so that will be interesting too! e.g. http://http2.github.io/faq/

dazza-codes commented 9 years ago

Notes on using curl, e.g.

$ curl -I http://id.loc.gov/authorities/subjects/sh85000399
HTTP/1.1 303 SEE OTHER
Location: http://id.loc.gov/authorities/subjects/sh85000399.html
Vary: Accept
X-URI: http://id.loc.gov/authorities/subjects/sh85000399
X-PrefLabel: Accordion music (Jazz)
Server: Apache
Content-Length: 0
Accept-Ranges: bytes
Date: Sun, 08 Mar 2015 15:40:45 GMT
X-Varnish: 817484216
Age: 0
Via: 1.1 varnish
Connection: keep-alive

curl will not automatically follow a redirect (303), unless the -L option is used. So it can be helpful to see the details of redirections. For this example, it redirects to an HTML page, but we want the RDF, e.g.

$ curl -I http://id.loc.gov/authorities/subjects/sh85000399.rdf
HTTP/1.1 200 OK
Content-type: application/rdf+xml
Cache-Control: public, max-age=43200
ETag: a906603072f5c988349b027364a6ef43
X-URI: http://id.loc.gov/authorities/subjects/sh85000399
Server: Apache
Content-Length: 5411
Accept-Ranges: bytes
Date: Sun, 08 Mar 2015 15:43:13 GMT
X-Varnish: 817485163
Age: 0
Via: 1.1 varnish
Connection: keep-alive
dazza-codes commented 9 years ago

At the client side, this might be partially solved by using https://github.com/crohr/rest-client-components, however the longer-term caching may require noting cache control data in RDF provenance details of some kind.