KentonWhite / ProjectTemplate

A template utility for R projects that provides a skeletal project.
http://projecttemplate.net
GNU General Public License v3.0
623 stars 159 forks source link

Feature Request: Selective cache/data loading routine #209

Closed rsangole closed 6 years ago

rsangole commented 7 years ago

I have a project where I'm dealing with a few large datasets in memory (30million+ rows). Since reading in the raw files and the processing operations are computationally expensive, I'm using the cache feature to save some intermediate dataframes and results. The challenge is I have a 3-4 datasets which are ~500MB to 5GB each.

Depending on where I am in my workflow, I only want to load one of the five .Rdata files from /cache to save time as well as memory, when I call load.project().

I have similar problems when I have a large datasets in the /data folder but only need a select few to auto-load.

If we can come up with a way to be selective on which datasets load.project() reads, it'll help use this package on projects with massive data.

Hugovdberg commented 7 years ago

The data.ignore option in global.dcf was designed for that purpose, it can also be given in the optional list argument to (re)load.project.

Op 8 sep. 2017 09:09 schreef "Rahul S" notifications@github.com:

I have a project where I'm dealing with a few large datasets in memory (30million+ rows). Since reading in the raw files and the processing operations are computationally expensive, I'm using the cache feature to save some intermediate dataframes and results. The challenge is I have a 3-4 datasets which are ~500MB to 5GB each.

Depending on where I am in my workflow, I only want to load one of the five .Rdata files from /cache to save time as well as memory, when I call load.project().

I have similar problems when I have a large datasets in the /data folder but only need a select few to auto-load.

If we can come up with a way to be selective on which datasets load.project() reads, it'll help use this package on projects with massive data.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/johnmyleswhite/ProjectTemplate/issues/209, or mute the thread https://github.com/notifications/unsubscribe-auth/AGn631-ynQst4oikp6eKw46iGQKKW_Cxks5sgUqqgaJpZM4PRO5U .

rsangole commented 7 years ago

@Hugovdberg is this documented somewhere? I don't recall reading data.ignore someplace... I'm sure I've missed something.

KentonWhite commented 7 years ago

It should be documented on the ProjectTemplate website. I still need to push the latest (v0.8) changes to the website.

rsangole commented 7 years ago

Ah, okay, let me know once it's up @KentonWhite

Hugovdberg commented 7 years ago

For now you can also look for data_ignore in mastering.markdown in the website directory of this project. It's perhaps a little hard to read, but you can already get started.

rsangole commented 6 years ago

@KentonWhite @Hugovdberg following up on this old request - if the v0.8 is pushed to the website, shall I close this request out?

KentonWhite commented 6 years ago

Don't think v0.8 has been pushed yet. Will check and push if required.

KentonWhite commented 6 years ago

v0.8 is pushed to the website.

alsmnn commented 5 years ago

I'm sorry for hijacking this issue, but I think my question is relevant to this. Is it somehow possible to ignore special files in cache/? If I understand the documentation Mastering ProjectTemplate right, the names fo files in cache/ are matched to the files in data/, in order to ignore these files, when data_ignore is set to a specific pattern. What happens when there are no corresponding files in data/ because the cached object is different from the files in data and has therefore another name?

example: ESCA_se is an entity of cancer samples from The Cancer Genome Atlas and there are 33 of them. The download and normalization process is quite ressource intense and needs lots of on-wall-time, therefore I cache the normalized objects, but I don't need every cached object in every analysis. So I would like to determine, which cached files should be loaded.

I have the files ESCA_se.RData and ESCA_se.hash in my cache/ directory and there is no corresponding file in data/. I don't need ESCA_se for every analysis and it is quite big. So I want to make shure, that it isn't loaded with reload.project(list(data_ignore = "ESCA*")), but that won't work.

I hope you can help me out.

Cheers, @AljoLe

KentonWhite commented 5 years ago

@AljoLe That use case makes a lot of sense. The expectation is that data_ignore = ... should not load data, even if it is cached. Would you mind opening a bug report for this issue please?

alsmnn commented 5 years ago

@KentonWhite Yes, i will open a bug report for this