Open tomkinsc opened 8 years ago
Though it's likely this requires aligned BAMs in order to work..
It does require aligned bams. It also requires Rscript
for generating the (required) HISTOGRAM_FILE
. Do we want to require R as a dependency of viral-ngs?
BWA outputs a running summary of insert sizes during alignment which can be used as a poor man's sanity check.
If aligned BAMs are the input, collecting these stats is trivially easy, it's possible that samtools even has a quickie way to dump that info. The R stuff would be more if we want to visualize it a certain way, but since we already are requiring matplotlib it might be preferable to do the plotting ourselves.
All that aside, if aligned BAMs are the input, I'm not actually sure where to incorporate such a step into the viral-ngs pipeline. The most useful place to see this is up front on the original set of raw reads, but that's not really possible in a world where we assume viruses are so diverse that short read aligners to a common reference genome aren't going to work as a first pass.
@haydenm is playing around with some mate-assembly tools (ie, see if the mates align to each other) to detect the presence of negative insert sizes in a more unbiased fashion (does not require alignment to a reference genome). My suggestion is to (perhaps optionally) add a filter-out-negative-insert-size step to our deplete/rmdup cleaning stage.
Similar to how we write a spikein report, we should consider adding a report on insert sizes via Picard's
CollectInsertSizeMetrics
.