Add option to run disambiguate or separate reads from two organisms (i.e. host vs. virus or host vs. parasite)

skchronicles commented 1 year ago

Given two reference genome, it would be awesome if we could examine the % composition of each organism, split the reads for each respective organism. This would allow a user to take those split reads and run them in any of our other pipelines depending on the project goal.

Options:

[x] Add disambiguate to pipeline
- Notes: takes aligned reads as input, would need to add different aligners for different inputs (DNA vs. RNA)

~~[ ] Run fastq_screen on both genomes, create bowtie2 indices and fastq_screen config file on the fly, split reads on tag~~

~~Notes: running bowtie2 on RNA is not ideal~~

skchronicles commented 1 year ago

Let's add the disambiguate option. This means we will need to add an alignment step, use bwa-mem if DNA or use STAR if RNA. We will need to add a new option to get the non-host genomic fasta file, --disambiguate-secondary-genome, and a second conditionally required option to specify whether the source is DNA or RNA (type: enum), --disambiguate-source.

If any --disambiguate-* option is provided, they all need to be provided: i.e. mutually inclusive group = --disambiguate-source, --disambiguate-host-genome, and --disambiguate-secondary-genome. This needs to be enforced.

See below for a description of the --disambiguate-host-genome option.

skchronicles commented 11 months ago

Lastly, add an option to capture the host/primary references genome via: --disambiguate-host-genome. This option should take an alias, enum of pre-built genomes + indices (i.e. human (hg19 | hg38), mouse (mm10 | mm39), rhesus (rheMac8), bat, tick, etc.), OR a genomic fasta file (infer by checking if it is a file and ends with the following extensions: '.fa' or '.fasta'). If a genomic fasta file is provided, an index will be built on the fly.

~~Please note:~~

STAR also needs an annotation to accurately align against the transcriptome. With that being said, if the --disambiguate-source is set to RNA; then we will need a way to capture the annotation for the host/primary and secondary genomes. As so, we will need two more options to capture this information:

~~--disambiguate-host-gtf: Annotation in GTF format of the host organism, can be used to ovverided an aliases default~~

~~--disambiguate-secondary-gtf: Annotation in GTF format of the secondery/other organism~~

Updated: Ignore adding a way to capture the annotation above (strike-through text above), we shouldn't need to do that for our purposes.

OpenOmics / weave

Add option to run disambiguate or separate reads from two organisms (i.e. host vs. virus or host vs. parasite) #25