Closed skchronicles closed 9 months ago
Let's add the disambiguate option. This means we will need to add an alignment step, use bwa-mem if DNA or use STAR if RNA. We will need to add a new option to get the non-host genomic fasta file, --disambiguate-secondary-genome
, and a second conditionally required option to specify whether the source is DNA or RNA (type: enum), --disambiguate-source
.
If any --disambiguate-*
option is provided, they all need to be provided: i.e. mutually inclusive group = --disambiguate-source
, --disambiguate-host-genome
, and --disambiguate-secondary-genome
. This needs to be enforced.
See below for a description of the --disambiguate-host-genome
option.
Lastly, add an option to capture the host/primary references genome via: --disambiguate-host-genome
. This option should take an alias, enum of pre-built genomes + indices (i.e. human (hg19 | hg38), mouse (mm10 | mm39), rhesus (rheMac8), bat, tick, etc.), OR a genomic fasta file (infer by checking if it is a file and ends with the following extensions: '.fa' or '.fasta'). If a genomic fasta file is provided, an index will be built on the fly.
Please note:
STAR also needs an annotation to accurately align against the transcriptome. With that being said, if the --disambiguate-source
is set to RNA
; then we will need a way to capture the annotation for the host/primary and secondary genomes. As so, we will need two more options to capture this information:
--disambiguate-host-gtf
: Annotation in GTF format of the host organism, can be used to ovverided an aliases default--disambiguate-secondary-gtf
: Annotation in GTF format of the secondery/other organism
Updated: Ignore adding a way to capture the annotation above (strike-through text above), we shouldn't need to do that for our purposes.
Given two reference genome, it would be awesome if we could examine the % composition of each organism, split the reads for each respective organism. This would allow a user to take those split reads and run them in any of our other pipelines depending on the project goal.
Options:
[x] Add disambiguate to pipeline
[ ] Run fastq_screen on both genomes, create bowtie2 indices and fastq_screen config file on the fly, split reads on tag