Flat reference - Githubissues

davisem commented 6 years ago

I used the flat reference option, to try to validate some samples for which a normal was not present. It causes a genome-wide copy loss effect, across all samples.

I confirmed that all the values in the reference are indeed zero (including the anti targets). And all the raw segments have very negative (-25) log2 values. Is this the expected behavior? screen shot 2018-01-25 at 4 54 06 pm

etal commented 6 years ago

The -25 typically appears when you have a large number of genomic bins with no sequencing read coverage, e.g. if the wrong target BED file was used, or there was a QC issue in sequencing.

Try the --drop-low-coverage option in the segment command to drop the zero-coverage bins, and see if that helps and how many bins were removed by that step. If that mostly helps, and you can see a discernable copy number profile in a scatter plot but the log2 ratios and segment means are still shifted overall, then you can re-center the .cnr and/or .cns files with the command call -m none --center median or call -m none --center-at [the log2 value that looks right].

Was this WGS, exome, or target panel?

davisem commented 6 years ago

Thanks for the good tips 👍. The samples were exome, and there are no QC issues in the data. The data looks great when ran with the normal references, or a pooled reference. I said we didn't have the normals, but that's not entirely correct. We have normals for these samples, but here I was running them with the flat reference in hopes to validate the use of your tool for when no normals are present. Thoughts on this? You can inbox me if you want since this isn't a coding issue.

etal commented 6 years ago

OK, we can use e-mail to share files if necessary but I'm happy to continue the discussion here so this Q&A shows up in search results.

The flat reference generally works pretty well for target panels since the targets are generally "easy" genes to sequence and the off-target bins are large enough to smooth out most irregularities. For exomes, with a flat reference you're losing most of the ability to filter out the unreliable targets, so even if those are a minority of the captured exons they can have a very visible effect on the output. The alternative is to rely on hard / ad-hoc filters, so in addition to the above:

Check your bin sizes versus the results of the autobin command, and also try running the targetcoverage.cnn and antitargetcoverage.cnn files through metrics to see if they have a similar level of noise -- if not, adjust bin sizes accordingly to ensure antitargets pull their own weight.
In segment, if you have the resources, try a few different values of --drop-outliers and compare the results with a known-good segmentation, e.g. from array CGH. The tradeoff is between sensitivity to small alterations and vulnerability to outliers.
You can use access -x to exclude more off-target regions that have unreliable coverage -- this can be based on any prior CNVkit results that you have.
Apply your own hard filters to the sample .cnn and/or .cnr files before segmentation, maybe using a minimum value in the depth column for on-target bins (data[(data['gene'] != 'Antitarget') & (data['depth'] >= 5)]). This is similar to what --drop-low-coverage does during segmentation, and prevents detecting homozygous deletions in germline samples, but if you're doing germline testing you really ought to have a large pool of normals to begin with.
Check out the new HMM segmentation methods in the development version of CNVkit -- the defaults are not well tuned, but the overall approach should be more resistant to noise and extreme values.

Note that you generally can use the same pool of normals for all cases sequenced with the same protocol and panel, including unpaired tumor samples. Following up on the last point, if an unpaired tumor sample is sequenced with a process that's a little different and you're unsure whether the original pooled reference applies, you can create a custom reference using info from both the pooled and flat references:

Delete bins in the flat reference where the pooled reference shows spread above a threshold (1.0 by default) or log2 outside a range (+/- 5 by default).
Use the weights column from the pooled reference, while keeping the log2 values from the flat reference (all 0 except for Y=-1.0 by default). You can do both of these at once by just editing the pooled reference to reset the log2 column to the default "flat" values.

davisem commented 6 years ago

This sounds great. You gave me a few very good ideas to follow up on. I think I'm going to put some effort in to coming up with a good pooled sample. Indeed this method works very well with paired samples.

Problem is, you usually never get a "normal" specimen to go with an FFPE, because the pathologists just don't collect it. But, I'm wondering if I can combine 100s of FFPE samples into a pool, and use that for the reference. Or I may be able to get some other data about FFPE's that turned out to be copy-normal by some other method. Thanks for your help!

etal commented 6 years ago

Sure thing. At UCSF we use a pool of blood-draw normals as the reference for mostly FFPE tumor samples, and it does perform better than a flat reference, as does a "flattened" version of the pooled reference where log2 values were reset but the rest is kept.

davisem commented 6 years ago

Nice I will try that as well. I've tried using the matched normals from blood draws (not FFPE), but it was very noisy. I will revisit this again with the flattening, and pooling. 👍

etal / cnvkit

Flat reference #307