Configure Cache Control

no-reply commented 7 years ago

From https://github.com/curationexperts/chf-sufia/issues/27#issuecomment-277555860:

Approaches for cache control:

Use a marmotta-like configuration as started in: https://github.com/ActiveTriples/linked-data-fragments/tree/cache_config/config

How can we support this kind of interface for other backends?

Does marmotta's approach meet our requirements?

Use HTTP Cache-Control headers.

We could probably support this in its entirety with: https://github.com/rtomayko/rack-cache

Is this too much client control?

Create a static caching strategy that clients don't have control over.

Will this paint us into a corner?

Is there a one-size fits all caching strategy we can define?

In any of these three cases, questions that need to be answered include:

What are the pre-warming needs?

Do search needs for an authority require a full pre-load of that authority?

Or can search be dependant on an external service?

What happens when the cache "invalidates"?

do we clear data from the backend,

or simply require re-fetch on the next retrieval if the remote is available?

What is an acceptable TTL?

Does the client ever need to manually invalidate the cache?

I think that's the general shape of the problem.

no-reply commented 7 years ago

A first hack at answering some of these:

A static caching strategy is probably the best option (for me, given current project constraints).
- The architecture should allow us to move in the direction of HTTP Cache-Control if more client control is desired.
- I'm worried about a marmotta-specific approach and its implications for the currently swappable backends. While i'm a big :+1: on leaning on Marmotta for it's caching backend, I'd prefer to see it work within the constraints of an interface that we think we can support for the general case.
Cache invalidation should just trigger a refetch--on next retrieval or in a background job.
TTL is probably very high. Weeks?
I don't think the client needs to invalidate; at least not in a minimal implementation.

I don't have any answers about pre-warming. cc: @hackmastera.

hackartisan commented 7 years ago

@no-reply your answers sound good to me.

+1 to backgrounding the refetch.
Client invalidation seems like it could be useful but not required upfront.
At CHF we have agreed we will rely on the external service for search; i.e. availability and retrieval time are less of a concern at cataloging than at display. I am a little nervous about the possibility of multi-day downtime, though, especially w/r/t LC. It would be good to spell out workarounds for getting cataloging done in this scenario: i.e., use a local authority as a temporary stop-gap until the service comes back up and the URIs can be filled in / cached?

no-reply commented 7 years ago

@hackmastera:

At CHF we have agreed we will rely on the external service for search; i.e. availability and retrieval time are less of a concern at cataloging than at display. I am a little nervous about the possibility of multi-day downtime, though, especially w/r/t LC.

:+1: The way I'm thinking of this is that the qa-ldf bridge will support a default search drawing only from items already in the cache. For smaller vocabularies, we would have the option to handle search entirely internally via pre-warming. For larger datasets we would lean on the external service, but have the option to provide search over the cached items during downtime.

I think for most users with a mature repository, having the capacity to search the cache would be enough to keep cataloging moving. Other work arounds could be discussed, but I'm thinking they are beyond project scope.

hackartisan commented 7 years ago

I think for most users with a mature repository, having the capacity to search the cache would be enough to keep cataloging moving. Other work arounds could be discussed, but I'm thinking they are beyond project scope.

I'm not sure about this assumption. For example, an archive moves from collection to collection over time and each collection will require a new set of vocabulary terms, even if they are within broadly related areas. Especially in the realm of personal names, but subjects as well.

But I guess if you assume that cataloging on an entirely new collection is relatively uncommon perhaps it holds up. maybe @catlu could weigh in.

no-reply commented 7 years ago

But I guess if you assume that cataloging on an entirely new collection is relatively uncommon perhaps it holds up.

Yeah, this is basically my assumption. Or at least that this is true enough to be of value. Upstream outages may be frustrating, and prevent folks from working on their highest priorities, but at least it wouldn't necessarily halt work altogether.

ActiveTriples / linked-data-fragments

Configure Cache Control #44