Refgenie integration - Githubissues

nsheff commented 3 years ago

PEPATAC originally took paths to assets. In fact, I think it still can take paths to some assets direclty. But then we adapted it to accept just a refgenie genome, instead, and then it finds the paths it needs. This is convenient because you just pass --genome X, and then it will use refgenie to find the paths of interest. But, one disadvantage is that it relies on regenie. Does PEPATAC allow to pass paths directly, like to bowtie2 indexes, if you don't use refgenie? There's no CLI argument for bowtie2 index, so I assume not. That is not ideal.

A new solution

Another solution is now possible using the new refgenie populate and looper plugin (documented here: http://refgenie.databio.org/en/latest/populate/). We can go back to the other way of doing it where PEPATAC is unaware of refgenie, and just takes file paths for each asset on the CLI. looper can handle the refgenie-awareness using the plugin.

Advantages

The pipeline/looper is aware of all files necessary at the beginning. This is required in CWL, since you have to specify all this so they can be staged. This may become necessary in the future.
The pipeline is possible to use without refgenie
It simplifies the pipeline itself, because it doesn't need to rely on refgenie or retrieve assets, etc; it just gets file paths or strings or whatever.
It simplifies dependencies because now it's not that every pipeline depends on refgenie; refgenie dependency is isolated to the looper stage.

Disadvantages

if you're running the pipeline without looper, it's harder to use since you have to specify all these assets now individually.
it inflates the CLI interface with additional arguments, previously were satisfied by the --genome one, and handled by refgenie.

How to do it

I think adding this to the pipeline interface would do the trick:

var_templates:
  refgenie_config: "$REFGENIE"
  bowtie2_index: "refgenie://{sample.genome}/bowtie2_index"
  other_asset: "refgenie://{sample.genome}/asset_name"
  ...
pre_submit:
  python_functions:
  - refgenconf.looper_refgenie_populate

Then you'd update it like command_template: --bowtie2_index {pipeline.var_templates.bowtie2_index}.

nsheff commented 3 years ago

I can confirm this works really well. I just set up a functional demo.

nsheff commented 3 years ago

Here's a working demo. It's really easy actually.

https://github.com/refgenie/refgenie_looper_demo

jpsmith5 commented 3 years ago

What are your thoughts on how we would handle prealignments?

Currently, it's a space-delimited list of genome names. Due to that, there is also a fair bit of logic in the pipeline built on the assembly identifiers, in addition to paths to indices.

Would we be asking for two variables: 1) a list of prealignment names and 2) a variable that is a space-delimited path of indices? To muddy that a bit more, are there separate ones for bowtie2 versus bwa based indices? If we would ask for both (names and index path), that would have the added complexity of matching prealignment genome identifier to its corresponding index (order of entry based? dangerous).

We could rework the use of the assembly identifiers in the code, but then we'd have to rethink how we report things which normally would be "clean" because they are simply the genome identifier, e.g. "Map to rCRSd", and without an identifier, may be something like "Map to /your/path/to/bowtie2_index/identifier/". Thoughts?

nsheff commented 3 years ago

good point.I think the names could be inferred from the space-delimited list of paths to indices, right?

But maybe this means it's time to switch to a config-based approach to this, where instead of taking everything on the command-line, you'd take a config file, that would look like:

read1: 
 - x1.fq
 - y1.fq
read2:
- x2.fq
- y2.fq
prealignments:
  rCRSd: /path/to/index
genome_alignment:
  hg38: /path/to/index

Two questions:

how would you balance what's in the config file vs. on the command line? Would everything go into the config file and the CLI gets simplified to just 1 argument: the config file?
how does this play with the pepatac.yaml config file that specifies settings for the tools? are they combined?

nsheff commented 3 years ago

@jpsmith5 I think I have a working solution using the new looper.write_custom_template plugin, in conjuction with the refgenconf.looper_refgenie_populate plugin, which I demonstrated with an example here: https://github.com/refgenie/refgenie_looper_demo

Basically you'd write a custom config file and pass that; the config file can do the lookups required.

Let me know what you think and if you want to pursue this.

jpsmith5 commented 3 years ago

Does the config file replace all command line arguments then? And therefore the template is covering all the bases? e.g.

output_parent: {{ looper.results_subdir }}
cores: {{ compute.cores }}
mem: {{ compute.mem }}
sample_name: {{ sample.sample_name }}
input: {{ sample.read1 }}
input2: {{ sample.read2 }}
single_or_paired: {{ sample.read_type }}
genome: {{ sample.genome }}
chrom_sizes: {{ refgenie[sample.genome]["fasta"]["chrom_sizes"] }}
TSS_name: {{ refgenie[sample.genome]["refgene_anno"]["refgene_tss"] }}
blacklist: {{ refgenie[sample.genome]["blacklist"] }}
anno_name: {{ refgenie[sample.genome]["feat_annotation"] }}
{% if sample.trimmer is defined %} trimmer: {{ sample.trimmer }} {% endif %}
{% if sample.aligner is defined %} aligner: {{ sample.aligner }} {% endif %}
{% if sample.aligner == "bowtie2" %}
  {% if sample.genome_index is defined %}
    genome_index: {{ sample.genome_index }} 
  {% else %}
    genome_index: {{ refgenie[sample.genome]["bowtie2_index"]["dir"] }}
    refgenie: 
      prealignments:{% for p in sample.prealignments %}
        {{ p }}: {{ refgenie[p]["bowtie2_index"]["dir"] }}{% endfor %}
  {% endif %}
{% else %}
  {% if sample.genome_index is defined %}
    genome_index: {{ sample.genome_index }} 
  {% else %}
    genome_index: {{ refgenie[sample.genome]["bwa_index"]["dir"] }}
    refgenie: 
      prealignments:{% for p in sample.prealignments %}
        {{ p }}: {{ refgenie[p]["bwa_index"]["dir"] }}{% endfor %}
  {% endif %} 
{% endif %}
{% if sample.deduplicator is defined %} deduplicator: {{ sample.deduplicator }} {% endif %}
{% if sample.peak_caller is defined %} peak_caller: {{ sample.peak_caller }} {% endif %}
{% if sample.peak_type is defined %} peak_type: {{ sample.peak_type }} {% endif %}
{% if sample.extend is defined %} extend: {{ sample.extend }} {% endif %}
{% if sample.genome_size is defined %} genome_size: {{ sample.genome_size }} {% endif %}
{% if sample.frip_ref_peaks is defined %} frip-ref-peaks: {{ sample.frip_ref_peaks }} {% endif %}
{% if sample.motif is defined %} motif: {% if sample.motif %} true {% else %} false {% endif %} {% endif %}
{% if sample.sob is defined %} sob: {% if sample.sob %} true {% else %} false {% endif %} {% endif %}
{% if sample.sob is defined %} search_file: {{ refgenie.tallymer_index.search_file }} {% endif %}
{% if sample.no_scale is defined %} no_scale: {% if sample.no_scale %} true {% else %} false {% endif %} {% endif %}
{% if sample.prioritize is defined %} prioritize: {% if sample.prioritize %} true {% else %} false {% endif %} {% endif %}
{% if sample.keep is defined %} keep: {% if sample.keep %} true {% else %} false {% endif %} {% endif %}
{% if sample.no_fifo is defined %} noFIFO: {% if sample.noFIFO %} true {% else %} false {% endif %} {% endif %}
{% if sample.lite is defined %} lite: {% if sample.lite %} true {% else %} false {% endif %} {% endif %}
{% if sample.skipqc is defined %} skipqc: {% if sample.skipqc %} true {% else %} false {% endif %} {% endif %}

nsheff commented 3 years ago

What about the parameters that are currently in pepatac.yaml ? Do you want to merge those into this somehow? How?

jpsmith5 commented 3 years ago

Yeah there are default parameters for each possible alignment tool and peak calling tool. There are also parameter settings for motif calling and samtools quality filtering. Alternatively, would you still pass the pipeline config file through the looper pypiper argument group -C argument, and have a separate sample parameters/config file argument for the sample based arguments?

nsheff commented 3 years ago

the pipeline config file could be an option in the sample config file

jpsmith5 commented 3 years ago

Yeah, that could work. So, in this scenario, the pipeline config file lives in its current location, but it is pointed to in the sample configuration file? And then in that way, you could create separate custom pipeline configuration files if you wanted to change settings for particular samples? That's a positive too.

nsheff commented 3 years ago

your config file above is basically the same as your command template in the pipeline interface:

https://github.com/databio/pepatac/blob/124df2f70f70a26759c0bbf9386835d05a99496f/sample_pipeline_interface.yaml#L6-L35

So you could in fact do all this in the pipeline interface anyway, and forget about the config thing, if you want.

jpsmith5 commented 3 years ago

I'll test it and see if this works. Previously I wasn't sure if I had the functionality to perform a for loop in the command template, but it sounds like that should be fine?

nsheff commented 3 years ago

the command template is just a jinja template.

jpsmith5 commented 3 years ago

the command template is just a jinja template. 👍

databio / pepatac

Refgenie integration #187

A new solution

Advantages

Disadvantages

How to do it