Closed rsangole closed 6 years ago
The data.ignore
option in global.dcf
was designed for that purpose, it
can also be given in the optional list argument to (re)load.project
.
Op 8 sep. 2017 09:09 schreef "Rahul S" notifications@github.com:
I have a project where I'm dealing with a few large datasets in memory (30million+ rows). Since reading in the raw files and the processing operations are computationally expensive, I'm using the cache feature to save some intermediate dataframes and results. The challenge is I have a 3-4 datasets which are ~500MB to 5GB each.
Depending on where I am in my workflow, I only want to load one of the five .Rdata files from /cache to save time as well as memory, when I call load.project().
I have similar problems when I have a large datasets in the /data folder but only need a select few to auto-load.
If we can come up with a way to be selective on which datasets load.project() reads, it'll help use this package on projects with massive data.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/johnmyleswhite/ProjectTemplate/issues/209, or mute the thread https://github.com/notifications/unsubscribe-auth/AGn631-ynQst4oikp6eKw46iGQKKW_Cxks5sgUqqgaJpZM4PRO5U .
@Hugovdberg is this documented somewhere? I don't recall reading data.ignore
someplace... I'm sure I've missed something.
It should be documented on the ProjectTemplate website. I still need to push the latest (v0.8) changes to the website.
Ah, okay, let me know once it's up @KentonWhite
For now you can also look for data_ignore
in mastering.markdown
in the website directory of this project. It's perhaps a little hard to read, but you can already get started.
@KentonWhite @Hugovdberg following up on this old request - if the v0.8 is pushed to the website, shall I close this request out?
Don't think v0.8 has been pushed yet. Will check and push if required.
v0.8 is pushed to the website.
I'm sorry for hijacking this issue, but I think my question is relevant to this.
Is it somehow possible to ignore special files in cache/
?
If I understand the documentation Mastering ProjectTemplate
right, the names fo files in cache/
are matched to the files in data/
, in order to ignore these files, when data_ignore
is set to a specific pattern. What happens when there are no corresponding files in data/
because the cached object is different from the files in data and has therefore another name?
example: ESCA_se is an entity of cancer samples from The Cancer Genome Atlas and there are 33 of them. The download and normalization process is quite ressource intense and needs lots of on-wall-time, therefore I cache the normalized objects, but I don't need every cached object in every analysis. So I would like to determine, which cached files should be loaded.
I have the files ESCA_se.RData
and ESCA_se.hash
in my cache/
directory and there is no corresponding file in data/
. I don't need ESCA_se
for every analysis and it is quite big.
So I want to make shure, that it isn't loaded with reload.project(list(data_ignore = "ESCA*"))
, but that won't work.
I hope you can help me out.
Cheers, @AljoLe
@AljoLe That use case makes a lot of sense. The expectation is that data_ignore = ...
should not load data, even if it is cached. Would you mind opening a bug report for this issue please?
@KentonWhite Yes, i will open a bug report for this
I have a project where I'm dealing with a few large datasets in memory (30million+ rows). Since reading in the raw files and the processing operations are computationally expensive, I'm using the cache feature to save some intermediate dataframes and results. The challenge is I have a 3-4 datasets which are ~500MB to 5GB each.
Depending on where I am in my workflow, I only want to load one of the five
.Rdata
files from/cache
to save time as well as memory, when I callload.project()
.I have similar problems when I have a large datasets in the
/data
folder but only need a select few to auto-load.If we can come up with a way to be selective on which datasets
load.project()
reads, it'll help use this package on projects with massive data.