New .is.cached() function

KentonWhite / ProjectTemplate

A template utility for R projects that provides a skeletal project.

http://projecttemplate.net

GNU General Public License v3.0

623 stars 159 forks source link

New .is.cached() function #185

Closed connectedblue closed 7 years ago

connectedblue commented 7 years ago

This issue surfaced during a review of #183.

The core requirement was to have a function called .is.cached(var) which on first glance looks like a simple one-liner to check for the existence of the appropriate .RData file.

However, there are deeper issues to consider with an is.cached function. For example:

should the existence of both .RData and .hash be checked, and only declared TRUE is both are present?
If .is.cached detects one missing, should it delete the other and report FALSE to maintain consistency?
should .is.cached be called by the cache loading part of load.project?
what do other calling functions need?
Should .is.cached remain only an internal function, or are there cases where the user needs to test for it?

Caching an object has quite a few rules associated with it, and there are quite a few ways for the cache to be corrupt and not do it's job (if .hash files got deleted, or got out of step with the in memory version of the .RData value). We need to be sure that is.cached is doing a reliable job.

I suggest a community discussion of this before a PR is raised.

Hugovdberg commented 7 years ago

I think .is.cached should report whether the variable can be loaded from cache without loading the raw data, but it should not have any side effects. So it should check validity of the cache (do both files exist and is the hash up to date), but not fix any invalid cached data. It would be nice to have a clean.cache function (not to be confused with clear.cache so perhaps it should have another name), that removes all orphaned caches for variables no longer in data/ and all incomplete caches.

Regarding load.project: Commit 646c10e includes the list.data function that reports the variables in the data/ directory, whether it is cached, ignored, and which reader will be used. It partially replicates the logic in .load.data without actually loading the data, with the intention to refactor .load.data and .load.cache as discussed in the review of #153. This is quite easy as list.data returns a data.frame with all inputs to the reader functions plus some metadata to determine which files should be loaded. Therefore I think .is.cached would be called indirectly.

Many of our functions have a lot of (nested) for loops that can be written more clearly and efficiently using vectorisation. I think most if not all of the auxiliary functions should accept vectors of variable names. .is.cached would in my opinion definitely be a function should solve the problem of multiple variables internally. file.exists can be called with a character vector, so perhaps only the hash checking should be written in a loop.

connectedblue commented 7 years ago

There is a danger that this becomes a solution looking for a problem.

There is no rule that says cached items also have to be in data. I quite often use cache() calls during munge to create derived data sets which I want to load fast during load.project(). And if we do need a clean.cache function, who calls it? It shouldn't be the user. Is is called during load.project()?

I think there's a case for is.cached doing a clean up. If var.RData and var.hash both exist then clearly the answer is TRUE. If neither exist, then clearly the answer is FALSE. If only one exists, then what should be returned? It shouldn't be TRUE because, well, it's not true. And if FALSE then it may as well delete the orphan file because nothing else can be done with it.

But ... like I said, this is all a bit of navel gazing. I don't have a clear view of the problem that needs solving to know what the right solution is

Hugovdberg commented 7 years ago

I realise just now there can indeed be cached variables not directly tied to data files. That's a thing I'll fix in the list.data function. But still I think any function called .is.* should purely report a status and not have any side effects. It should indeed report FALSE if not both files exist, but I wouldn't expect it to delete anything, or at least not by default. I don't agree that clean.cache shouldn't be called by the user. Any function that potentially destroys data should be called explicitly. Any function which name implies it only reads things and reports a status should not do something else without the user telling it explicitly to do something else. Perhaps it shouldn't be a separate function, but a newer version of clear.cache with a new a new flag clear_outdated that tells it to clean up invalidated cache files. That's a function I'd expect to remove files, not .is.cached or list.data or load.project. But I agree with you that we shouldn't write functions just to write functions, but to solve problems and manage expectations.

The reason I want a function to check whether a file is cached is that it would make sense that variables that are ignored in the data/ directory are ignored in cache/ as well. Instead of doing all the checking twice it would be better to build a list of all variables and determine which should be loaded just once. The list.data function does exactly this, in a way that can be used both for debugging ignore patterns by the user as well as for loading the data by load.project.

connectedblue commented 7 years ago

The reason I wrote the extensions to cache() was that I was finding that I was writing my own cache handling functions in munge scripts in order to speed up load.project() when developing scripts. Afterwards the extensions for load.project(), clear.cache and so on built on this idea to allow the user to have speed when they need (by not recalculating over and over), but also the flexibility to start over and re-compute/re-load from scratch if the underlying data/rules have changed.

So, I'm all for tools which allow the user to develop analysis quicker using cached data. However, cache management itself is not something the user should ever have to care about - they should never need to know how it works or need to fix inconsistent cache data. So, if we want to ensure a cache is kept consistently, we need to do it as part of load.project() or cache because they are the two functions which read and write the cache (I know clear.cache also writes the cache, but we can't be sure the user will need to run that - that's more of an optional function).

Whichever internal functions handle the cache consistency checking, I don't really mind too much, as long as it's well documented for developers to understand, and it has good test cases associated with it.

KentonWhite commented 7 years ago

There are only two hard things in Computer Science: cache invalidation and naming things.

-- Phil Karlton

Dealing with a corrupt cache is an important issue. It may only happen occasionally, but when it does happen we want to recover gracefully and correctly.

I would take the Unix approach here. When we detect an invalid cache, give the user a list of options to chose from to recover. These can include:

attempting the repair the cache for example by adding a .RData file
removing the cached item entirely
ignoring the problem for now
others I'm missing

connectedblue commented 7 years ago

Yes, I remember seeing that quote somewhere else on another issue. I'm beginning to feel its resonance.

So, the debate started about a particular function and led to how to keep integrity of the cache.

I'll raise a PR to add cache validity checking to load.project, and I'm sure somewhere we'll also end up with an is.cached() function that can be used reliably for multiple purposes.

Hugovdberg commented 7 years ago

I think I'm with @KentonWhite on providing the user with options to recover the cache if invalid. Would it be possible for us to collaborate on rewriting the load.project? I have some ideas on how "my" list.data function could help making the loading process more efficient, and it doesn't make sense to rewrite this logic twice.

connectedblue commented 7 years ago

You go ahead @Hugovdberg. I actually would rather work on the custom templates feature, and I don't really want to get distracted.

Let me know if you have any questions about the cache model - hopefully the code is documented well enough.