broadinstitute / viral-ngs

Viral genomics analysis pipelines
Other
189 stars 67 forks source link

Write report of insert sizes #473

Open tomkinsc opened 8 years ago

tomkinsc commented 8 years ago

Similar to how we write a spikein report, we should consider adding a report on insert sizes via Picard's CollectInsertSizeMetrics.

dpark01 commented 8 years ago

Though it's likely this requires aligned BAMs in order to work..

tomkinsc commented 8 years ago

It does require aligned bams. It also requires Rscript for generating the (required) HISTOGRAM_FILE. Do we want to require R as a dependency of viral-ngs?

yesimon commented 8 years ago

BWA outputs a running summary of insert sizes during alignment which can be used as a poor man's sanity check.

dpark01 commented 8 years ago

If aligned BAMs are the input, collecting these stats is trivially easy, it's possible that samtools even has a quickie way to dump that info. The R stuff would be more if we want to visualize it a certain way, but since we already are requiring matplotlib it might be preferable to do the plotting ourselves.

All that aside, if aligned BAMs are the input, I'm not actually sure where to incorporate such a step into the viral-ngs pipeline. The most useful place to see this is up front on the original set of raw reads, but that's not really possible in a world where we assume viruses are so diverse that short read aligners to a common reference genome aren't going to work as a first pass.

@haydenm is playing around with some mate-assembly tools (ie, see if the mates align to each other) to detect the presence of negative insert sizes in a more unbiased fashion (does not require alignment to a reference genome). My suggestion is to (perhaps optionally) add a filter-out-negative-insert-size step to our deplete/rmdup cleaning stage.