Closed nsheff closed 3 years ago
I can confirm this works really well. I just set up a functional demo.
Here's a working demo. It's really easy actually.
What are your thoughts on how we would handle prealignments?
Currently, it's a space-delimited list of genome names. Due to that, there is also a fair bit of logic in the pipeline built on the assembly identifiers, in addition to paths to indices.
Would we be asking for two variables: 1) a list of prealignment names and 2) a variable that is a space-delimited path of indices? To muddy that a bit more, are there separate ones for bowtie2 versus bwa based indices? If we would ask for both (names and index path), that would have the added complexity of matching prealignment genome identifier to its corresponding index (order of entry based? dangerous).
We could rework the use of the assembly identifiers in the code, but then we'd have to rethink how we report things which normally would be "clean" because they are simply the genome identifier, e.g. "Map to rCRSd", and without an identifier, may be something like "Map to /your/path/to/bowtie2_index/identifier/". Thoughts?
good point.I think the names could be inferred from the space-delimited list of paths to indices, right?
But maybe this means it's time to switch to a config-based approach to this, where instead of taking everything on the command-line, you'd take a config file, that would look like:
read1:
- x1.fq
- y1.fq
read2:
- x2.fq
- y2.fq
prealignments:
rCRSd: /path/to/index
genome_alignment:
hg38: /path/to/index
Two questions:
pepatac.yaml
config file that specifies settings for the tools? are they combined?@jpsmith5 I think I have a working solution using the new looper.write_custom_template
plugin, in conjuction with the refgenconf.looper_refgenie_populate
plugin, which I demonstrated with an example here: https://github.com/refgenie/refgenie_looper_demo
Basically you'd write a custom config file and pass that; the config file can do the lookups required.
Let me know what you think and if you want to pursue this.
Does the config file replace all command line arguments then? And therefore the template is covering all the bases? e.g.
output_parent: {{ looper.results_subdir }}
cores: {{ compute.cores }}
mem: {{ compute.mem }}
sample_name: {{ sample.sample_name }}
input: {{ sample.read1 }}
input2: {{ sample.read2 }}
single_or_paired: {{ sample.read_type }}
genome: {{ sample.genome }}
chrom_sizes: {{ refgenie[sample.genome]["fasta"]["chrom_sizes"] }}
TSS_name: {{ refgenie[sample.genome]["refgene_anno"]["refgene_tss"] }}
blacklist: {{ refgenie[sample.genome]["blacklist"] }}
anno_name: {{ refgenie[sample.genome]["feat_annotation"] }}
{% if sample.trimmer is defined %} trimmer: {{ sample.trimmer }} {% endif %}
{% if sample.aligner is defined %} aligner: {{ sample.aligner }} {% endif %}
{% if sample.aligner == "bowtie2" %}
{% if sample.genome_index is defined %}
genome_index: {{ sample.genome_index }}
{% else %}
genome_index: {{ refgenie[sample.genome]["bowtie2_index"]["dir"] }}
refgenie:
prealignments:{% for p in sample.prealignments %}
{{ p }}: {{ refgenie[p]["bowtie2_index"]["dir"] }}{% endfor %}
{% endif %}
{% else %}
{% if sample.genome_index is defined %}
genome_index: {{ sample.genome_index }}
{% else %}
genome_index: {{ refgenie[sample.genome]["bwa_index"]["dir"] }}
refgenie:
prealignments:{% for p in sample.prealignments %}
{{ p }}: {{ refgenie[p]["bwa_index"]["dir"] }}{% endfor %}
{% endif %}
{% endif %}
{% if sample.deduplicator is defined %} deduplicator: {{ sample.deduplicator }} {% endif %}
{% if sample.peak_caller is defined %} peak_caller: {{ sample.peak_caller }} {% endif %}
{% if sample.peak_type is defined %} peak_type: {{ sample.peak_type }} {% endif %}
{% if sample.extend is defined %} extend: {{ sample.extend }} {% endif %}
{% if sample.genome_size is defined %} genome_size: {{ sample.genome_size }} {% endif %}
{% if sample.frip_ref_peaks is defined %} frip-ref-peaks: {{ sample.frip_ref_peaks }} {% endif %}
{% if sample.motif is defined %} motif: {% if sample.motif %} true {% else %} false {% endif %} {% endif %}
{% if sample.sob is defined %} sob: {% if sample.sob %} true {% else %} false {% endif %} {% endif %}
{% if sample.sob is defined %} search_file: {{ refgenie.tallymer_index.search_file }} {% endif %}
{% if sample.no_scale is defined %} no_scale: {% if sample.no_scale %} true {% else %} false {% endif %} {% endif %}
{% if sample.prioritize is defined %} prioritize: {% if sample.prioritize %} true {% else %} false {% endif %} {% endif %}
{% if sample.keep is defined %} keep: {% if sample.keep %} true {% else %} false {% endif %} {% endif %}
{% if sample.no_fifo is defined %} noFIFO: {% if sample.noFIFO %} true {% else %} false {% endif %} {% endif %}
{% if sample.lite is defined %} lite: {% if sample.lite %} true {% else %} false {% endif %} {% endif %}
{% if sample.skipqc is defined %} skipqc: {% if sample.skipqc %} true {% else %} false {% endif %} {% endif %}
What about the parameters that are currently in pepatac.yaml ? Do you want to merge those into this somehow? How?
Yeah there are default parameters for each possible alignment tool and peak calling tool. There are also parameter settings for motif calling and samtools quality filtering. Alternatively, would you still pass the pipeline config file through the looper pypiper argument group -C
argument, and have a separate sample parameters/config file argument for the sample based arguments?
the pipeline config file could be an option in the sample config file
Yeah, that could work. So, in this scenario, the pipeline config file lives in its current location, but it is pointed to in the sample configuration file? And then in that way, you could create separate custom pipeline configuration files if you wanted to change settings for particular samples? That's a positive too.
your config file above is basically the same as your command template in the pipeline interface:
So you could in fact do all this in the pipeline interface anyway, and forget about the config thing, if you want.
I'll test it and see if this works. Previously I wasn't sure if I had the functionality to perform a for loop in the command template, but it sounds like that should be fine?
the command template is just a jinja template.
the command template is just a jinja template. 👍
PEPATAC originally took paths to assets. In fact, I think it still can take paths to some assets direclty. But then we adapted it to accept just a refgenie genome, instead, and then it finds the paths it needs. This is convenient because you just pass
--genome X
, and then it will use refgenie to find the paths of interest. But, one disadvantage is that it relies on regenie. Does PEPATAC allow to pass paths directly, like to bowtie2 indexes, if you don't use refgenie? There's no CLI argument for bowtie2 index, so I assume not. That is not ideal.A new solution
Another solution is now possible using the new
refgenie populate
and looper plugin (documented here: http://refgenie.databio.org/en/latest/populate/). We can go back to the other way of doing it where PEPATAC is unaware of refgenie, and just takes file paths for each asset on the CLI. looper can handle the refgenie-awareness using the plugin.Advantages
Disadvantages
--genome
one, and handled by refgenie.How to do it
I think adding this to the pipeline interface would do the trick:
Then you'd update it like
command_template: --bowtie2_index {pipeline.var_templates.bowtie2_index}
.