Analysis Variable - Githubissues

vsoch commented 6 years ago

hey @fbartusch, Another question for you! I've created some cloud builders that can be launched to run snakemake on Google Cloud (compute) and I tested the valgrind (memory) analysis across about 16 different instance types. Since it's tiny (so far) the memory doesn't seem to make a difference. What I think I'd want to do (which would be useful for HPC) is to vary some variable set by the scientist to then assess how results are influenced. Is snakemake a bad contender for that? If so, what other things could we vary that would be useful / interesting?

vsoch commented 6 years ago

Here is more detail on what I've done so far (I'm parsing the results from this now) https://github.com/sci-f/snakemake.scif/tree/add/races/results/cloud

fbartusch commented 6 years ago

Hey @vsoch , you can use Snakemake to vary variables in the workflow. Actually I think that snakemake is not worse than other software for that purpose. Maybe you read this page already? Regarding the example workflow, I think the best steps for trying different variables are the bwa_map and bcftools_call steps. Options for bwa mem are listed here. I think the following options could have a big influence on the result:

-k INT | Minimum seed length [19]
-B INT | Mismatch penalty [4]
-O INT | Gap open penalty. [6]

For bcftools_call respectively

-c, --consensus-caller: the original samtools/bcftools calling method (conflicts with -m)
-p, --pval-threshold float. with -c, accept variant if P(ref|D) < float.

You could add some of these variables into the snakemake workflow and create config files with different variable settings. Then you can specify which variables to use when running snakemake with the --configfile FILE option.

Since the data in this repo is just for testing purpose, I don't know if you'll see big changes in the result if you try other variables.

vsoch commented 6 years ago

Okay, so reading the docs I think we want to take the following approach:

choose the set of variables to vary (you did this above)
define defaults in the config.yaml file (and I see you already have the samples here)
create a grid of variables and values to run
run across the same machine type

Then I assume we would want to look at the all.vcf file? Or are we still interested in memory and time? Given that we find some different in result or runtime metric, is our evaluation then that "the fastest" or "least memory required" is really associated with best? In other words, if we were running this grid of metrics for a researcher, what kind of advice would we give him after doing it.

Since the data in this repo is just for testing purpose, I don't know if you'll see big changes in the result if you try other variables.

Do you mean to say that you don't think doing the variation will have much influence? I think Snakemake definitely fits the bill for running the kind of comparison we want to do, and (the much harder part, for me at least) is deciding (in advance) if there is some variation (in what?) how do we evaluate it's goodness.

vsoch commented 6 years ago

The other approach (when talking about variables) that is interesting would be to show how a single library / software changes over time (calling the same function) or doesn't.

vsoch commented 6 years ago

There are also easy ways to do this with continuous integration, e.g., using a grid in travis (see example --> https://github.com/pydicom/pydicom/blob/master/.travis.yml) but there it's harder to have control of the results.

vsoch commented 6 years ago

ah and here is an example for travis-izing circle! https://github.com/michaelcontento/circleci-matrix

fbartusch commented 6 years ago

Given that we find some different in result or runtime metric, is our evaluation then that "the fastest" or "least memory required" is really associated with best?

No. You want to get meaningful results for your scientific problem and the runtime or memory consumption is secondary. The choice of the parameters is very situation-dependent and up to the researcher. Time and memory consumption is interesting if you compare two algorithms with comparable input parameters.

Do you mean to say that you don't think doing the variation will have much influence?

I think it will influence the number of variants found. I just don't know how to interpret the changes since I'm not an expert in this domain. I tried the vcf-stats tool. It creates simple statistics for the .vcf-file like 'indel_count' and 'snp_count'. The parameters I mentioned above will influence the specificity and thus the number of variants will change.

The other approach (when talking about variables) that is interesting would be to show how a single library / software changes over time (calling the same function) or doesn't.

That is really an interesting idea. I don't know if there are good studies about that for popular software.

There are also easy ways to do this with continuous integration

I never used continuous integration, but I have to keep the circleci thing in my mind. It looks very convenient.

fbartusch / snakemake_tutorial

Analysis Variable #4