broadinstitute / viral-ngs

Viral genomics analysis pipelines
Other
187 stars 67 forks source link

create bam merger for unaligned reads #313

Open dpark01 opened 8 years ago

dpark01 commented 8 years ago

Create a script that merges BAM files of unaligned reads together while also presenting the opportunity to rewrite certain bits of commonly rewritten metadata, like the sample name and library IDs. This is likely commonly invoked on things like depleted read BAM files to merge together data from multiple Illumina lanes. The command line arguments we expose should allow for an intuitive DNAnexus applet, which is what we are ultimately aiming for.

tomkinsc commented 8 years ago

We already have a wrapper around Picard's MergeSamFiles, so is this a matter of dumping the header, altering it, and calling samtools' reheader? Or do we need to change the read records as well?

dpark01 commented 8 years ago

So I'm thinking of some combination of read_utils.merge_bams and read_utils.reheader_bams. And doing it is not the hard part, but figuring out the right kind of UI to expose is the trickier part. Note that the current reheader_bams command takes a complicated parameter list as an input flat file. I think I'd prefer to expose ordered lists of argparse parameters and such.

Currently, in the snakemake based pipeline, this all happens automatically but it requires the user to specify way up front (before any computation even begins) all of those barcode files and such--and if they make any changes or corrections, they have to start all over from the beginning (demux).

I want to move towards giving the user the ability to merge and reheader everything pre-assembly but post-depletion, so that the automated demux-deplete-metagenomics that happens can pretty much always be used and don't have to be re-executed if someone decides to change a sample naming convention or correct which runs had which library IDs. My thought is that users would use this new tool just prior to assembly and that it would enable proper iSNV pipelines downstream of that.

Then we could potentially refactor how the flowcell/barcode files are used by snakemake rules, but at the very least, I'd like to expose a DNAnexus applet for it. I see that as the only missing piece between our current demux-deplete and the downstream assembly.