Add configurable cache entry expiration

elrayle commented 7 years ago

I am exploring adding cache entry expiration. I would like to get feedback from those using linked-data-fragments for caching.

Approach:

TimeToLive configuration - There will be a global TimeToLive value that serves as a default. There can also be a TimeToLive interval defined per host with the default used if the current URI's host does not have a separate TimeToLive configured.

ExpirationDT for a URI - Each cached URI will have an extra triple added to identify the date-time on which the cached entry for the URI expires and becomes invalid. ExpirationDT = date_retrieved + TimeToLive(URI_host)

Modifications to Retrieval Algorithm

    if ExpirationDT < now, attempt to get from source
        if success
            update URI's cached value
            reset ExpirationDT
            return value
        if source unavailable
            do NOT adjust ExpirationDT
            Log host out of service
            return value
    else
        return value from cache

Predicate for ExpirationDT - I have not found a predicate that matches the concept exactly. The closest I have at the moment is http://vivoweb.org/ontology/core#expirationDate. I am open to suggestions for an alternate predicate.

Other additions that could be part of this work.

Optional ForceRecache - Retrieve method can have a parameter added to allow caller to request the URI's cache be updated from source. What would you want returned if the host is out of service?

LastModifiedDT - Add a new triple that holds the LastModifiedDT for the cache.

Thoughts on predicate choices. I am somewhat hesitant to use existing predicates that aren't cache specific. If the cached URI happens to use the same predicates, they would get clobbered by the cache added predicates. I'd like to see predicates: cache_expiration_dt and cache_last_modified_dt, and possible cache_create_dt. Others thoughts?

Please comment on this approach as soon as you can. I am looking at beginning work as soon as I get feedback.

hackartisan commented 7 years ago

Can you give an example of the extra triple?

anarchivist commented 7 years ago

Can you talk more about the motivation for adding this as a triple? I'm concerned about the potential impact here. Is there an assumption that this triple would be included in the serialized representation?

acoburn commented 7 years ago

The way Marmotta handles this is like so: triples from the resource are cached in one location and metadata about those cached triples is stored separately (i.e. when the triples were retrieved). This way, it is possible to configure a TTL globally (or by endpoint) without mixing the metadata with the triples from the resource. Marmotta also does not use RDF to store that metadata, nor is there any inherent need to do so. As an example, Marmotta's file-based cache looks like this:

http://localhost:8080/fcrepo/rest/test
1470333573969 # last retrieved: 2016-08-04 13:59:33.969
1470419973969 # expires: 2016-08-05 13:59:33.969
1 # 1 updates
0 # 0 triples

This way, you don't mix the resource triples and the metadata about those triples; nor do you run into namespace clashes between that metadata and the triples themselves.

elrayle commented 7 years ago

@acoburn Thanks for that info. I am new to Marmotta and linked data fragments, so pointers are appreciated.

My only concern is that this gem also provides caching in Blazegraph. My approach would have to be compatible for Marmotta and Blazegraph (and other potential repositories).

elrayle commented 7 years ago

@hackmastera

<subject_URI> <http://vivoweb.org/ontology/core#expirationDate> "2016-11-02T00:00:00Z"^^<http://www.w3.org/2001/XMLSchema#dateTime>

tpendragon commented 7 years ago

Considering the use case here is effectively a reverse proxy cache for an external RDFSource, I'm :-1: on triples attached to the subject URI for configuration of the caching system. I might be able to be convinced that a second URI (or maybe a named graph, but I'm iffy there too) which is included in the response could have that, though.

The other thing is there's no way to update triples in this gem now, sort of on purpose. You'd need that functionality to add the expiration triple, yes? The use cases were always simple: cache external responses and provide information about that cache. I think there's a benefit to keeping it that way.

Global TTL seems like a good config option. The Marmotta backend's always had it, but surfacing it in this layer is a good thing for those backends that don't have caching built in.

elrayle commented 7 years ago

@anarchivist The motivation is that some authorities change the display string, and potentially other triple values, associated with a controlled vocabulary term. If you capture the triples associated with a URI once and never update, you will be using a stale cache value. Having a configurable TimeToLive value allows you to invalidate subject_URIs in the cache, forcing a refresh from the original source. Having TimeToLive be configurable by host allows for a more flexible approach to cache refresh, so that an authority that rarely modifies its data can have a longer TimeToLive setting than an authority who commonly modifies data.

tpendragon commented 7 years ago

I will say, I have a feeling that most users of this don't want hard expirations - they want something like periodic updates from upstream. If the remote source goes down, you want your cache to work even if the TTL is over.

tpendragon commented 7 years ago

(They might not even want AUTOMATIC periodic updates - I've heard concerns about data drift in remote sources before, but maybe that's a second product which mints a sameAs URI for temporal locks)

elrayle commented 7 years ago

When a subject expires, the proposal says attempt to get from source and if unsuccessful (i.e. server is down) then use the cache.

tpendragon commented 7 years ago

@elrayle Yeah, I think I could agree if the workflow was more like "if TTL was past, queue up a refresh in the background and serve up a response QUICKLY anyways, with a header saying it's stale", then have some method to block the response while waiting for a cache update.

elrayle commented 7 years ago

One piece I left out of the proposal that we were discussing locally is having something like a cron job that crawls the cache at night and attempts a refresh on expired subject_URIs.

BTW, I like the idea of using a named graph to hold expiration dates. That avoids potential conflict with the cached data.

tpendragon commented 7 years ago

Basically my use cases around TTL are these:

It needs to be easy for me to get a fast response from already-cached triples even if the remote source is down or slow, no matter what my TTL is.
I need to easily be able to get the exact response I would have gotten from the source, with no modifications.
I need a way to force a refresh of one URI and wait on it, and have some indication that the forced refresh succeeded. (Maybe this is just a header asking for a modified date past the last time it was refreshed? I dunno)

Anything which solves those three use cases I'm :+1: for.

acoburn commented 7 years ago

@elrayle You may want to take a look at the Marmotta LDCache interface for inspiration.

In particular, the get(URI, RefreshOpts) method accepts both a URI and a RefreshOpts value that determines exactly how to handle stale entries.

elrayle commented 7 years ago

@tpendragon 1 and 3 make sense to me. Can you expand on 2? I think I know what you mean, but want to be sure.

elrayle commented 7 years ago

I would be fine with @tpendragon suggestion for a modification to the retrieval algorithm...

retrieve from cache
if ExpirationDT < now, start background job to get from source and update ExpirationDT
return value from cache

hackartisan commented 7 years ago

I agree with the direction of this discussion. The example triple confuses me because it appears to conflate a real-world-object with its uri representation. if the subject uri is some name authority you would essentially be asserting that that person has an expiration date. It seems like some form of reification would solve that problem, though I'm not sure exactly what that would need to look like.

elrayle commented 7 years ago

@hackmastera I see your point. I think it would be easy to avoid the triple in Marmotta based on the feedback from @acoburn. Blazegraph and other repository implementations may be more challenging.

I would be less concerned with the conflation with a real-world-object if the predicate were better named, e.g. cache_expiration_dt.

Based on feedback, for triplestore implementations, I propose...

the triple for expiration would be stored in a name graph
the expiration triple would NOT be returned as part of the set of triples retrieved from the cache for the given subject_URI
the expiration datetime could be returned in the header

For Marmotta, I would use the internal mechanism already in Marmotta.

ActiveTriples / linked-data-fragments

Add configurable cache entry expiration #31

Approach:

Other additions that could be part of this work.