etal / cnvkit

Copy number variant detection from targeted DNA sequencing
http://cnvkit.readthedocs.org
Other
545 stars 165 forks source link

Flat reference #307

Closed davisem closed 6 years ago

davisem commented 6 years ago

I used the flat reference option, to try to validate some samples for which a normal was not present. It causes a genome-wide copy loss effect, across all samples.

I confirmed that all the values in the reference are indeed zero (including the anti targets). And all the raw segments have very negative (-25) log2 values. Is this the expected behavior? screen shot 2018-01-25 at 4 54 06 pm

etal commented 6 years ago

The -25 typically appears when you have a large number of genomic bins with no sequencing read coverage, e.g. if the wrong target BED file was used, or there was a QC issue in sequencing.

Try the --drop-low-coverage option in the segment command to drop the zero-coverage bins, and see if that helps and how many bins were removed by that step. If that mostly helps, and you can see a discernable copy number profile in a scatter plot but the log2 ratios and segment means are still shifted overall, then you can re-center the .cnr and/or .cns files with the command call -m none --center median or call -m none --center-at [the log2 value that looks right].

Was this WGS, exome, or target panel?

davisem commented 6 years ago

Thanks for the good tips 👍. The samples were exome, and there are no QC issues in the data. The data looks great when ran with the normal references, or a pooled reference. I said we didn't have the normals, but that's not entirely correct. We have normals for these samples, but here I was running them with the flat reference in hopes to validate the use of your tool for when no normals are present. Thoughts on this? You can inbox me if you want since this isn't a coding issue.

etal commented 6 years ago

OK, we can use e-mail to share files if necessary but I'm happy to continue the discussion here so this Q&A shows up in search results.

The flat reference generally works pretty well for target panels since the targets are generally "easy" genes to sequence and the off-target bins are large enough to smooth out most irregularities. For exomes, with a flat reference you're losing most of the ability to filter out the unreliable targets, so even if those are a minority of the captured exons they can have a very visible effect on the output. The alternative is to rely on hard / ad-hoc filters, so in addition to the above:

Note that you generally can use the same pool of normals for all cases sequenced with the same protocol and panel, including unpaired tumor samples. Following up on the last point, if an unpaired tumor sample is sequenced with a process that's a little different and you're unsure whether the original pooled reference applies, you can create a custom reference using info from both the pooled and flat references:

davisem commented 6 years ago

This sounds great. You gave me a few very good ideas to follow up on. I think I'm going to put some effort in to coming up with a good pooled sample. Indeed this method works very well with paired samples.

Problem is, you usually never get a "normal" specimen to go with an FFPE, because the pathologists just don't collect it. But, I'm wondering if I can combine 100s of FFPE samples into a pool, and use that for the reference. Or I may be able to get some other data about FFPE's that turned out to be copy-normal by some other method. Thanks for your help!

etal commented 6 years ago

Sure thing. At UCSF we use a pool of blood-draw normals as the reference for mostly FFPE tumor samples, and it does perform better than a flat reference, as does a "flattened" version of the pooled reference where log2 values were reset but the rest is kept.

davisem commented 6 years ago

Nice I will try that as well. I've tried using the matched normals from blood draws (not FFPE), but it was very noisy. I will revisit this again with the flattening, and pooling. 👍