KentonWhite / ProjectTemplate

A template utility for R projects that provides a skeletal project.
http://projecttemplate.net
GNU General Public License v3.0
623 stars 159 forks source link

Feature Req - selective munging #221

Closed alsmnn closed 6 years ago

alsmnn commented 6 years ago

Report an Issue / Request a Feature

I'm submitting a (Check one with "x") : [ ] bug report [x] feature request


Issue Severity Classification -

(Check one with "x") : [ ] 1 [ ] 2 [x] 3

  1. Low - Minimal impact. Issue is an annoyance, but I can get work done. This indicates the problem causes little impact on operations or that a reasonable circumvention to the problem has been implemented. Usability quirks, requires minor documentation updates, or could be enhance with some minor changes to the function.

Dear Creators, thanks for the awesome package ProjectTemplate. We are currently using PT as a collection of analyses for our Next Generation Sequencing Projects. Sometimes we don´t have all data sets to run all of our collected analyses, e.g. sometimes we don´t have time series data to run a maSigPro analysis. We have corresponding munging files in munge/ for each script in src/ (e.g., munge/01.Analysis01 for src/Analysis01.R), to make these analyses as modular as possible. Some analyses take several minutes to be processed and sometimes we just want to run a subset of all possible analyses we have in our PT. So here comes my question: Is it possible to ignore some munge files in a reload.project(munging_ignore = c("02.*", "03.*", "04.*")) manner? When this is possible, we could just open our Report.Rmd file and switch analyses on or off, with the typical RMarkdown behaviour of evaluating and including code. I hope you get the idea behind it.

Thanks for all of your work!

Expected Behavior

I would like to determine which munging files should be executed and which not.

Current Behavior

all files in munge/ will be executed

Steps to Reproduce Behavior

How can we reproduce the behavior?

  1. load.project()
Possible Solution (optional)

ignore munging script with load.project(list(munging_ignore=c("01.Script1.R", "02.Script2.R"))) corresponding to the data_ignore feature, which is possible in ProjectTemplate v 0.8

Thanks and best regards from Hamburg, Germany

Hugovdberg commented 6 years ago

There is the possibility of adding project specific configuration. You could add some options in lib/globals.R and add a test in each mungescript if it should be run. This allows you to create your own options with a meaningful naming convention that makes sense to your project instead of ignoring a few files. Do you think that would be sufficient for you?

We added the ignore_data option because it's hard to encode in a datafile whether it should be loaded, but since your munge scripts are fully capable R scripts it's easy to add a check whether it should be run or not.


Example lib/globals.R:

add.config(
    run_maSigPro = FALSE
)

And in your munge script:

if (config$run_maSigPro) {
    # Munge script 01
}
Hugovdberg commented 6 years ago

At this point it isn't possible to add custom options to the override.config argument of load.project, but perhaps we could remove the check so you can actually override your settings from lib/globals.R.

alsmnn commented 6 years ago

Thank you for the hint with lib/globals.R! It works, but it is more a workaround, than the functionality I intended. It would still import every data set, which is in data/ (Which is intended and another thing), but I am looking for a more dynamic approach in the project. We have lots of different kind of data and normally I´d run a script to import the data from raw-data/, make sure everything is imported in the right way (header, comma-seperated or semicolon-seperated, etc.) and write it to the data/ directory with the intended naming. With this in mind, it would be awesome if I could just run the import, munging and src scripts independently for every analysis. This would make shure, that just the data is loaded which is needed (some data sets are small, some are quite big). Sometimes a report should just contain one analysis, sometimes up to 8 analyses with all different kinds of plots (more than 20). So it is cumbersome to run all analyses, if only a subset is needed. I know how to change the lib/globals file, in order to get just a subset working, but my colleagues don´t (Yes, they should learn R too!). All of this is possible with RMakrdown and the scripts, except the munging. I could also run everything in different scripts without ProjectTemplate, but than I would loose some functionalities, which are really nice (caching, auto- installing/loading packages, etc.). Yes I can change it via lib/globals.R, but this won´t happen on the fly and isn´t really dynamic, or is that in some kind possbile? I´m sorry, if I am overlooking something, because I´m pretty new to R, RMarkdown, and so on.

Thanks for your help, I really appreciate it!

KentonWhite commented 6 years ago

I think it should be possible to expose functions that do the data loading, data munging and data src scripts separately. These functions are already exposed as helper function for load.project().

@AljoLe would adding something like load.data(), load.munge(), and load.src() help?

alsmnn commented 6 years ago

Hello @KentonWhite, thanks for the fast reply and yes, I think this approach would be exactly what I was looking for.

Hugovdberg commented 6 years ago

When we expose the separate steps for loading the project there still isn't an option to specify which munge scripts should be run. The data and munge steps can be run separately using the options data_loading and munging, but for specifying the munge scripts there is no option yet. The scripts in src aren't run automatically anyway. I think it would be more convenient to remove the check in config.R, or at least keep the extra configuration options. That way you can override the options from lib/globals.R through the same interface as the other configuration options. Expanding my example I posted previously, you could then do this:

Example lib/globals.R:

add.config(
    run_maSigPro = TRUE
)

And in your munge script:

if (config$run_maSigPro) {
    # Munge script 01
}

And to skip the maSigPro you could call in src/eda.R:

load.project(list(run_maSigPro = FALSE))

Would that make life easier for your colleagues? They don't need to change the config in lib/globals.R, but can use meaningful options instead of selecting which munge scripts to run or not to run. This can be especially tricky if you need to exclude combinations of scripts.

I would prefer this option over adding munge_ignore because lib/globals.R has a more general application than only the munge scripts, and I agree that overriding the custom configuration isn't quite user friendly at the moment.

KentonWhite commented 6 years ago

Or if we did expose a load.data() and load.munge()function, could we have them read lib/globals.R? I agree that a load.src() doesn't fit (sorry I wrote that!). src folder is meant for one-off scripts.

Hugovdberg commented 6 years ago

Could you write some example code how to run certain munge scripts selectively in your proposed solution? Because to me it still seems that exposing the load.munge function alone doesn't solve the problem which munge scripts are actually run. There is indeed an extra complication that exposing the functions separately must perform parts of the load sequence, at least loading the configuration. Or load.data and load.munge should just use the config from the .TargetEnv, which should be loaded with load.config first. You could then change the configuration before calling load.data and load.munge, but that seems a lot more complicated to me than to allow overriding the config in load.project. As I understand it now you propose to be able to load projects like this:

library(ProjectTemplate)
load.config()
config$run_maSigPro <- FALSE
load.data()
load.munge() # Or perhaps call it munge.data()

Is that how you envision your solution?

KentonWhite commented 6 years ago

I wasn't thinking of loading individual files separately, just breaking out the steps. So you could run load.data() and it would be equivalent to running load.project() with munging set to false in the configuration file. Similarly, running munge.data() (like that name!) would be equivalent to running load.project() with data.loading and cache.loading set to false. Both functions would respect the global.dcf settings.

Loading individual data files or running individual munge scripts goes against the philosophy of ProjectTemplate. Each Project is an atomic unit. ProjectTemplate simply facilitates loading that atomic unit. If parts of a Project should be loaded separately, then they should be separate Projects.

Does this view make sense?

Hugovdberg commented 6 years ago

I can see a use case for selectively applying munge scripts from a custom template if not all data is available at the start of the project. Perhaps the first statistical analysis determines what physical analyses to perform next. But I agree that ProjectTemplate is meant to be run as a unit, so that's why I suggested to make it easier to override lib/globals.R from load.project. That way end users can build in their own logic to determine if an analysis should be performed.

Your proposal is to just create two shorthands like this:

load.data <- function(override.config = list()) {
    override.config$munging = FALSE
    load.project(override.config)
}
munge.data <- function(override.config = list()) {
    override.config$data_loading = FALSE
    override.config$cache_loading = FALSE
    load.project(override.config)
}

I see no harm in adding these shorthands. But I also don't see how it would actually help @AljoLe, but that doesn't mean we shouldn't add these as well.

alsmnn commented 6 years ago

Hello everybody and thanks for the nice discussion. I hoped that we could run analyses separately, because we determine, as @Hugovdberg said, which analyses to run based on the first inspection. The option to switch an analysis on or off with load.project(list(run_maSigPro = FALSE)) is a nice way to have everything in one place and to decide which script should be used when it is possible to overrite the config like that. Thank you very much! I think exposing functions like load.data and munge.data as @KentonWhite mentioned, wouldn´t help much, because it won´t me allow to selectively load data or munge files. Thank you for your tremendous participation on problems of your normal users! Best regards

KentonWhite commented 6 years ago

Thanks. I do understand better now the problem. We'll need to think about how to expose individual munge scripts. I can see the utility.

Let me play devil's advocate for a moment - not to belittle the idea but to better understand it. Why can't this custom processing be handled in a src script? If after loading the data you determine that analysis B should be run instead of analysis A, why doesn't source('src/analysis_B.R') solve the problem?

Hugovdberg commented 6 years ago

I will try to fix load.project to allow to override custom configuration as well later this week. I think we can change the signature of load.project at the same time to take a ... argument instead of a list, so you can customize the settings as load.project(run_maSigPro = FALSE).

alsmnn commented 6 years ago

@KentonWhite source('src/analysis_B.R') would do the trick, when all of the munging would be there too, but that is not the case and, I hope I get that right, the ProjectTemplate philosophy wants to have munging and src in different places. We created a munging script for each src script, so it can be as modular as possible. Sometimes we don´t have all data, or just want to run certain analyses, but we want our template allways to be complete (with all scripts for all possible analyses). Otherwise we will have version problems (some analyses are v2 some are in v3 all in v4, but again a subset in v5, etc.) and I want to avoid that at all costs.

@Hugovdberg I think this would be the easiest attempt to solve the problem and a pretty neat one, too.

Thanks and best regards,

Hugovdberg commented 6 years ago

Hmm, I missed the remark by @KentonWhite, I think we don't actually need to implement a change specifically to make a selection of the munge scripts. By allowing easy override of custom configuration users can implement this themselves however they see fit. Since the munge scripts are sources as normal R code, the scripts can determine from the configuration whether they should be run or not. We can provide the extensibility without sacrificing convention over configuration. We run all munge scripts, but if a script decides to terminate immediately based on the custom configuration that's fine I guess.

Hugovdberg commented 6 years ago

@AljoLe I changed load.project to be able to override configuration from globals.R, you can test the code by checking out https://github.com/Hugovdberg/ProjectTemplate/tree/override_globals_R (to install it directly use devtools::install_github('Hugovdberg/ProjectTemplate@override_globals_R')). Please let me know if this works for you by replying to the pull request #222.

KentonWhite commented 6 years ago

PR has bee merged. Closing the issue now. Please reopen if there are new issues. Thanks!

alsmnn commented 6 years ago

Hi @Hugovdberg and @KentonWhite, I´m sorry for the late response, but we had other priorities in the last weeks, but I appreciate your work. I downloaded the latest ProjectTemplate and tried to get the selective munging to work. I added some entries in lib/globals.R add.config( QualityControl = FALSE, HeatmapsClustering = FALSE, Deconvolution = FALSE, IPA = FALSE, DESeq = FALSE, HotColdScore = FALSE, TCGA_download = FALSE, Survival = FALSE, Biclustering = FALSE ) These are the several munging scripts for our analyses. The munging scripts start with:

if(config$Survival) {} corresponding to each entry in lib/globals.R

How can I turn them on/off with load.project() or reload.project() ? Thanks for your help and your great work!

Hugovdberg commented 6 years ago

if you want to turn any of these options on dynamically you should change the call to add.config to include the option apply.override = TRUE, so in your example:

add.config(
    QualityControl = FALSE, 
    HeatmapsClustering = FALSE, 
    Deconvolution = FALSE, 
    IPA = FALSE, 
    DESeq = FALSE, 
    HotColdScore = FALSE, 
    TCGA_download = FALSE, 
    Survival = FALSE, 
    Biclustering = FALSE,
    apply.override = TRUE # This turns on the override mechanism
)

Then if you want to run for example the Survival munge scripts you can simply call load.project(Survival = TRUE).

alsmnn commented 6 years ago

Hi @Hugovdberg, thank you for the fast response. It works fine now. The only thing I have to note, is that reloading the project with just one munging script turned on gives the following output to console:

Munging data
Running preprocessing script: 01.QualityControl.R
Running preprocessing script: 02.heatmaps_clustering.R
Running preprocessing script: 03.deconvolution.R
Running preprocessing script: 04.ipa.R
Running preprocessing script: 05.DESeq.R
Running preprocessing script: 06.TCGA_download.R
Running preprocessing script: 07.Hot_Cold_Score.R
Running preprocessing script: 08.Survival.R
Running preprocessing script: 09.Biclustering.R
'select()' returned 1:many mapping between keys and columns
Disp = 0.49403 , BCV = 0.7029
Creating cache entry from global environment: Edge_S
Running preprocessing script: 06.TCGA_download.R
Running preprocessing script: 07.Hot_Cold_Score.R
Running preprocessing script: 08.Survival.R
Running preprocessing script: 09.Biclustering.R

In this case only "DESeq" was turned on, but the script says, that ProjectTemplate is running all munging scripts, although they have the following statement in the beginning of the script:

if(config$Survival){

}

Is it possible to state only the scripts which are actually ran or to give the output, that the other scripts evaluated to FALSE?

Best regards, Aljoscha

Hugovdberg commented 6 years ago

Technically all munge scripts are run, there is no way for load.project to detect whether it would be skipped. You could make it implement this yourself by adding an else clause to the script:

if (config$Survival) {
    # Run script
} else {
    message('\tSkipping survival analysis. Enable by passing "survival = TRUE" to load.project')
}

You could even format the message to make it stand out a little more clearly using the crayon package:

message(crayon::yellow(crayon::bold('Skip')))

library(crayon) # Or add it to globals.dcf
if (config$Survival) {
    # Run script
    message(green(bold('\tRunning survival analysis')))
} else {
    message(bold(yellow('\tSkipping survival analysis.'), 'Enable by passing "survival = TRUE" to load.project'))
}