OpenOmics / weave

An awesome BCL demultiplexing and FastQ quality-control pipeline
https://openomics.github.io/weave/
MIT License
1 stars 0 forks source link

Add option to run disambiguate or separate reads from two organisms (i.e. host vs. virus or host vs. parasite) #25

Closed skchronicles closed 9 months ago

skchronicles commented 1 year ago

Given two reference genome, it would be awesome if we could examine the % composition of each organism, split the reads for each respective organism. This would allow a user to take those split reads and run them in any of our other pipelines depending on the project goal.

Options:

skchronicles commented 1 year ago

Let's add the disambiguate option. This means we will need to add an alignment step, use bwa-mem if DNA or use STAR if RNA. We will need to add a new option to get the non-host genomic fasta file, --disambiguate-secondary-genome, and a second conditionally required option to specify whether the source is DNA or RNA (type: enum), --disambiguate-source.

If any --disambiguate-* option is provided, they all need to be provided: i.e. mutually inclusive group = --disambiguate-source, --disambiguate-host-genome, and --disambiguate-secondary-genome. This needs to be enforced.

See below for a description of the --disambiguate-host-genome option.

skchronicles commented 11 months ago

Lastly, add an option to capture the host/primary references genome via: --disambiguate-host-genome. This option should take an alias, enum of pre-built genomes + indices (i.e. human (hg19 | hg38), mouse (mm10 | mm39), rhesus (rheMac8), bat, tick, etc.), OR a genomic fasta file (infer by checking if it is a file and ends with the following extensions: '.fa' or '.fasta'). If a genomic fasta file is provided, an index will be built on the fly.

Please note:

STAR also needs an annotation to accurately align against the transcriptome. With that being said, if the --disambiguate-source is set to RNA; then we will need a way to capture the annotation for the host/primary and secondary genomes. As so, we will need two more options to capture this information:

Updated: Ignore adding a way to capture the annotation above (strike-through text above), we shouldn't need to do that for our purposes.