KentonWhite / ProjectTemplate

A template utility for R projects that provides a skeletal project.
http://projecttemplate.net
GNU General Public License v3.0
622 stars 159 forks source link

Rebuild cache if the underlying data changed #276

Open Hugovdberg opened 5 years ago

Hugovdberg commented 5 years ago

Report an Issue / Request a Feature

I'm submitting a (Check one with "x") :


Issue Severity Classification -

(Check one with "x") :

Expected Behavior

When a file in data/ is changed but the resulting variable exists in the cache the file is not reloaded.

Current Behavior

Currently caching of the data is only done after the variable is loaded into memory, and cached variables are not reloaded if the original file was changed.

Version Information
Possible Solution

Update the cache function to also include a file argument, similar to the depends argument. If the digest of the file has changed reload the file and rebuild the cache. This could be done inside the reader as follows (using the 1.0 reader signature):

csv.reader <- function(file.name, variable.name, ...) {
    cache(variable.name,
          CODE = {
              read.csv(file.name, ...)
          },
          file = file.name
    )
}

This way assigning the variable in global namespace is left to cache, the CODE argument is evaluated as it is normally inside the cache function, and is only updated if the dependency in the file argument changed.

How do you guys feel about this?

KentonWhite commented 5 years ago

I like if the cache can tell if the file has changed. This should make workflow easier. The only edge case I can see are researchers working with unstable data and using the cache to capture a particular state they are working with now.

Hugovdberg commented 5 years ago

This could actually be improved by this change, because if you cache the files once and then set data_loading = FALSE, cache_loading = TRUE in the config or in your call to [re]load.project() the files are loaded from the cache, or you could even exclude certain volatile files with data_ignore. I think we should consider those researches who have volatile data in data/ but which should not always be reloaded the exceptions, and improve the workflow for the majority of people. Of course we should make sure cache_loading = FALSE, data_loading = TRUE also still works as expected.

bugsysiegals commented 5 years ago

In order to tell the if a file is changed, can we just compare the modified data of the file with the creation date of the cache file? I believe I seen another "Reproducible Research" project which used makefile in this way to only process specific files.

bugsysiegals commented 5 years ago

Rather than implementing this into the cache function wouldn't it be better to implement directly into the loading function to automate this process? Perhaps a yes/no question could be asked to allow the user to not load the new file...

KentonWhite commented 5 years ago

Comparing created and modified timestamps is risky. Sometimes modified timestamps are updated by the operating system even though nothing has changed in the filed.

Asking a user each time a cache file is being updated is also error prone. With many files, the question becomes a nuisance and the user mindless hits "y".

Currently, you can pass a list of variable names to clear.cache to rebuild a particular cache.

bugsysiegals commented 5 years ago

Excellent points, thanks for the clarity.

From an automation standpoint, one would simply call clear.cache() prior to load.project() for a full reload?

Perhaps someday another function could be added or parameter could be passed into load.project which compares files. It’s not critical but would allow a person to possibly automate E2E and produce results as quickly as possible without needing to reload very large unchanged datasets.

KentonWhite commented 5 years ago

Yes call clear.cache() before load.project(). What I do is call clear.cached with datasets I expect will be updated. I'll often make a call to a database. It's difficult for ProjectTemplate to tell if the database has changed, so I'll call clear.cache() with the name of the dataset read from the database. In an automated workflow the database is refreshed and everything else stays the same.

bugsysiegals commented 5 years ago

Yes it would really only benefit those who are pulling in files. I'll also be trying to connect to DB's where possible but of course will have to rely on some files. At the end of the day, a few extra minutes to load data isn't going to matter unless I'm sitting there watching it load and getting impatient! :)

Hugovdberg commented 5 years ago

load.project also has a reset argument which clears the cache when set to TRUE.

I agree with @KentonWhite that simply using modification date is tricky. Also, I think, using cache in the reader could make the cache and data loading simpler in load.project.

Op do 27 sep. 2018 01:52 schreef bugsysiegals notifications@github.com:

Yes it would really only benefit those who are pulling in files. I'll also be trying to connect to DB's where possible but of course will have to rely on some files. At the end of the day, a few extra minutes to load data isn't going to matter unless I'm sitting there watching it load and getting impatient! :)

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/KentonWhite/ProjectTemplate/issues/276#issuecomment-424907496, or mute the thread https://github.com/notifications/unsubscribe-auth/AGn639Z177NCoMVzCHf4IWfxyYDqxt7dks5ufBM8gaJpZM4WVUH4 .