Closed davisem closed 6 years ago
The -25 typically appears when you have a large number of genomic bins with no sequencing read coverage, e.g. if the wrong target BED file was used, or there was a QC issue in sequencing.
Try the --drop-low-coverage
option in the segment
command to drop the zero-coverage bins, and see if that helps and how many bins were removed by that step. If that mostly helps, and you can see a discernable copy number profile in a scatter
plot but the log2 ratios and segment means are still shifted overall, then you can re-center the .cnr and/or .cns files with the command call -m none --center median
or call -m none --center-at [the log2 value that looks right]
.
Was this WGS, exome, or target panel?
Thanks for the good tips 👍. The samples were exome, and there are no QC issues in the data. The data looks great when ran with the normal references, or a pooled reference. I said we didn't have the normals, but that's not entirely correct. We have normals for these samples, but here I was running them with the flat reference in hopes to validate the use of your tool for when no normals are present. Thoughts on this? You can inbox me if you want since this isn't a coding issue.
OK, we can use e-mail to share files if necessary but I'm happy to continue the discussion here so this Q&A shows up in search results.
The flat reference generally works pretty well for target panels since the targets are generally "easy" genes to sequence and the off-target bins are large enough to smooth out most irregularities. For exomes, with a flat reference you're losing most of the ability to filter out the unreliable targets, so even if those are a minority of the captured exons they can have a very visible effect on the output. The alternative is to rely on hard / ad-hoc filters, so in addition to the above:
autobin
command, and also try running the targetcoverage.cnn and antitargetcoverage.cnn files through metrics
to see if they have a similar level of noise -- if not, adjust bin sizes accordingly to ensure antitargets pull their own weight.segment
, if you have the resources, try a few different values of --drop-outliers
and compare the results with a known-good segmentation, e.g. from array CGH. The tradeoff is between sensitivity to small alterations and vulnerability to outliers.access -x
to exclude more off-target regions that have unreliable coverage -- this can be based on any prior CNVkit results that you have.depth
column for on-target bins (data[(data['gene'] != 'Antitarget') & (data['depth'] >= 5)]
). This is similar to what --drop-low-coverage
does during segmentation, and prevents detecting homozygous deletions in germline samples, but if you're doing germline testing you really ought to have a large pool of normals to begin with.Note that you generally can use the same pool of normals for all cases sequenced with the same protocol and panel, including unpaired tumor samples. Following up on the last point, if an unpaired tumor sample is sequenced with a process that's a little different and you're unsure whether the original pooled reference applies, you can create a custom reference using info from both the pooled and flat references:
spread
above a threshold (1.0 by default) or log2
outside a range (+/- 5 by default).weights
column from the pooled reference, while keeping the log2
values from the flat reference (all 0 except for Y=-1.0 by default). You can do both of these at once by just editing the pooled reference to reset the log2
column to the default "flat" values.This sounds great. You gave me a few very good ideas to follow up on. I think I'm going to put some effort in to coming up with a good pooled sample. Indeed this method works very well with paired samples.
Problem is, you usually never get a "normal" specimen to go with an FFPE, because the pathologists just don't collect it. But, I'm wondering if I can combine 100s of FFPE samples into a pool, and use that for the reference. Or I may be able to get some other data about FFPE's that turned out to be copy-normal by some other method. Thanks for your help!
Sure thing. At UCSF we use a pool of blood-draw normals as the reference for mostly FFPE tumor samples, and it does perform better than a flat reference, as does a "flattened" version of the pooled reference where log2 values were reset but the rest is kept.
Nice I will try that as well. I've tried using the matched normals from blood draws (not FFPE), but it was very noisy. I will revisit this again with the flattening, and pooling. 👍
I used the flat reference option, to try to validate some samples for which a normal was not present. It causes a genome-wide copy loss effect, across all samples.
I confirmed that all the values in the reference are indeed zero (including the anti targets). And all the raw segments have very negative (-25) log2 values. Is this the expected behavior?