broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.69k stars 588 forks source link

CollectReadCounts weighted average across multiple nearby bins (feature request) #5432

Closed fpbarthel closed 5 years ago

fpbarthel commented 5 years ago

Hi GATK team,

I'm interested in calling copy number from exome sequencing using larger (eg. 10kb) bins. Rather than treating each bin separate, I would be interesting in averaging read counts across multiple nearby bins within a specified window. Eg. average read counts from 10 adjacent exons (each separate bins) of a gene.

Would this be possible?

Thanks! Floris

samuelklee commented 5 years ago

Hi @fpbarthel,

You have a few options, but I think they would all require a bit of manual processing on your part. For example, you could use CollectReadCounts with your initial list of bins, but then you'd have to do the averaging step manually. Hopefully, it should not be too trouble to put together your own script to do this; however, since it is somewhat of a custom operation, it's unlikely we'll support it as a feature.

In any case, note that the rest of the downstream tools in both our germline and somatic pipelines are only set up to work with integer counts. I don't know your reasons for wanting to average counts over multiple bins, but since 1) averages contain less information than the full resolution bins, 2) the BAM I/O to perform count collection is expensive (so we don't want to discard information during the CollectReadCounts step that we might need later), and 3) we want to build generative models of the counts, our philosophy is to always retain the integer counts.

fpbarthel commented 5 years ago

@samuelklee thanks for your response. The reason I propose this is that I've noticed that with WGS increasing the --bin-length parameter (to eg 10kb vs the recommended 1kb) to PreprocessIntervals dramatically increases the stability and decreases the noise for lower quality input (eg. FFPE) or lower coverage (eg. lowpass whole genomes). However, for exome sequencing there doesn't seem to be such a feature. So if your input data is of lower quality (eg. FFPE) there is no way to average across larger bins. I suppose what you are suggesting will work and will give it a try, but I am wondering if there are any plans to improve the CNV calling for lower quality WXS samples?

samuelklee commented 5 years ago

@fpbarthel In that case, I would not average the bins---I'd simply take the total integer count across them, which you can then use with the rest of the pipeline and should decrease the noise as you describe. You can still use CollectReadCounts, but it will be up to you whether you want to 1) collect counts on manually merged bins, or 2) collect counts on the initial bins and manually sum the counts for the merged bins. Another option might be to use PreprocessIntervals to create 10kb bins across the genome and then manually intersect that with your list of exons (i.e., keeping only the bins that overlap with your exons) before collecting counts.

There are multiple ways one might define such merged bins, which makes it hard to come up with a single tool to cover every possibility. Fortunately, it should be relatively easy for users to put together their own custom script for merging bins. For these reasons, I think it makes sense to not focus too much CNV-team development effort on adding features like this---our philosophy is to leave such pre/postprocessing steps to the user, and focus on CNV-specific algorithms such as denoising, segmentation, etc. Hopefully that makes sense!

fpbarthel commented 5 years ago

Another option might be to use PreprocessIntervals to create 10kb bins across the genome and then manually intersect that with your list of exons (i.e., keeping only the bins that overlap with your exons) before collecting counts.

Thanks @samuelklee, that sounds like a pretty solid idea, to sum up read counts from targets in 10kb bins. Would it be necessary to apply some sort of weighting? Ie. I imagine that some 10kb bins may have only a single 100bp exon overlapping (ie in gene deserts) whereas another 10kb bin could have a dozen 500bp exons, and thus much higher read counts purely because a larger overlap with target regions from the capture experiment.

samuelklee commented 5 years ago

@fpbarthel The denoising methods in both the somatic and germline pipelines correct for bin-specific biases. So no weighting of the raw counts should be necessary.