etal / cnvkit

Copy number variant detection from targeted DNA sequencing
http://cnvkit.readthedocs.org
Other
547 stars 166 forks source link

Depth based confidence intervals around copy number estimates #28

Closed mjafin closed 9 years ago

mjafin commented 9 years ago

Hi @etal, wanted to pick your brain on how to implement some kind of "confidence intervals" around the copy number estimates. Ideally there could be an interval around the point estimate that would reflect depth of sequencing and smoothness of the copy number signal around the interval (could be e.g. MAD http://en.wikipedia.org/wiki/Median_absolute_deviation with a provision for depth).

For example, a log2 copy number of 1 surely should be more meaningful in 20x coverage data compared to 1x coverage data, right?

I'm more than happy to implement something and help but wanted to hear your thoughts first.

etal commented 9 years ago

Hi Miika, here are a couple of things that come to mind that can be done with the current code base and the cnvlib API:

Modelling the reliability or confidence interval of each segment based on the underlying read depths, as you described, is really more a feature of the segmentation algorithm. For example, the recently published CODEX introduces a segmentation method based on raw read counts and their expected distribution, not entirely unlike cn.MOPS:

CNVkit has a modular approach to segmentation: Just implement a wrapper for a method in cnvlib/segmentation/__init__.py and call it with the segment command. A wrapper for or reimplementation of CODEX's segmentation procedure would be nice.

After the next point release I'm going to change the coverage command to not median-center the (target|antitarget)coverage.cnn files before writing, so that the actual read counts are still available.

I'm also working on replacing the internal representation of genomic intervals to use Pandas instead of Numpy structured arrays, so that .cnr and .cns files can more easily store arbitrary additional columns -- more like Bioconductor's GenomicRange package. (If you know of an existing Python package that provides pandas-based GenomicRanges, please let me know!) This would provide a place to store the segment-level metadata other segmentation algorithms emit, if any.

If you'd like to work on something more exploratory I'd be happy help with that, too -- just e-mail me.

mjafin commented 9 years ago

@etal Thanks for the very comprehensive reply, much appreciated. I'll take a look at the weight column to begin with. Bootstrapping might give us more accurate intervals but I'd hate to have to go to resampling if it implies a lot longer computational time.

Not seen the CODEX paper before, will give it a read. I'm not an expert in Python numerical packages but I'll keep an eye out for any that may reproduce GenomicRanges like functionality.

mjafin commented 9 years ago

Just a quick update, I've been downsampling a publicly available cell line (pair) with a known amplification to observe what happens to the estimated copy number. When going from avg coverage of 5x to 0.5x the log2 copy number stays roughly around the same (nice). I did a very crude estimation of the lower bound of a 95% confidence interval (if you can call it that) based on the 2.5% percentile of all the log2 values in the cnr file for the region. Given that the individual intervals are roughly the same width for all the values I didn't do any weighting of the values.

At 5x the lower bound is roughly 0.8 log2 units below the copy number and at 0.5x it's roughly 2 log2 units below. Nice to see the numbers support the fact that with lower coverage there's more variability.

mjafin commented 9 years ago

@etal I've been toying around with interval estimation at https://github.com/etal/cnvkit/blob/master/cnvlib/segmentation/__init__.py#L148

The following code would estimate a prediction/confidence interval from the distribution of the bin means:

fit$output$quants = rep(NA, length(fit$output$mean))
for (ind in 1:length(fit$output$quants)){
    if (!is.na(fit$segRows$startRow[ind]))
        fit$output$quants[ind] = paste(quantile(fit$data$y[fit$segRows$startRow[ind]:fit$segRows$endRow[ind]], c(0.025,0.975)), collapse=";")
}

However, I'm not sure how to best handle this at https://github.com/etal/cnvkit/blob/master/cnvlib/segmentation/__init__.py#L47 to get the interval printed out in the cns file. Do you have any recommendations/suggestions? I've tried a few things to inject the interval into the output but I get cryptic errors from numpy.

etal commented 9 years ago

It looks like you're formatting the confidence interval as a string, so you could put it in the "gene" column of the .cns file, as it looks like you were aiming to do given the line you selected.

The R code you pasted could be replicated in Python using the function numpy.percentile and a similar-looking loop otherwise. If you feel comfortable trying that, I'd be happy to take a pull request, otherwise let me know and I'll give it a shot.

Incidentally, or maybe more importantly, is the 2.5-97.5-percentile range in the observed data really the 95% confidence interval as we usually think of it?

mjafin commented 9 years ago

Thanks @etal, I'll look into piggy backing the interval somehow in the gene column and will prepare a pull request.

I was thinking of using the Python percentile function too, but not all the information required is returned from the R function so rather than bloating the table R returns, I'd say it's easier to carry out the calculations in R.

I've been thinking about the philosophical side of the percentile range too and what it means. On one hand, I think what I'm proposing is a sort of resampling based confidence interval for a (one) bin, as it's based on observing the distribution of the (log2) bin means from the same segment. The bin means are in a way repeated measurements of the same thing (within segment). On the other hand, it's also a prediction interval of the segment total mean.

etal commented 9 years ago

Which information is missing from the table returned by the R function? Is it the mapping of probes to segments? If so, it would make sense to add that. I can do so.

I just used a script to discover:

  1. The bootstrap approach is not as slow as I thought; it's faster than segmentation itself on exome samples. So this could be done on the fly at the end of segmentation.
  2. The segment mean emitted by CBS ignores the bin weights! To fix this I'll need to recalculate each segment mean from its underlying bins anyway, so might as well do the bootstrap there while I have them.
mjafin commented 9 years ago

Noticed yesterday too that the mean by CBS is literally just the mean of the values ignoring weights and the x-axis (wasn't sure if it was intentional or not)!

I had a go using your script - here's an example output table:

chromosome      start   end     gene    log2    probes
chr17   133     22239009        CI=-5.88_-5.74  -6.26166        28189
chr17   25280863        37640219        CI=-3.51_-3.32  -3.00418        19698
chr17   37640219        38137566        CI=5.93_6.06    5.96198 1152
chr17   38137566        81191601        CI=-3.26_-3.16  -3.01774        63151

Here are the results of my approach (ignoring weights and trusting the CBS means):

chromosome      start   end     gene    log2    probes
chr17   133     22239009        G;CI=-3.298384_4.077177 0.373   28189
chr17   25280863        37640219        G;CI=-3.0434715_4.139435        0.5993  19698
chr17   37640219        38137566        G;CI=3.944334_8.784433  5.9891  1152
chr17   38137566        81191601        G;CI=-3.00873_4.322255  0.6752  63151

There are a few issues that would be good to discuss:

  1. The means are wildly different between weighted and unweighted estimates! Funnily enough, for the most part (3 out of 4) the estimated intervals do not contain the non-resampled mean values (well, not guaranteed to for resampling). When you do your bootstrapping, it might be necessary to extend the weighting to the bootstrap means too (might make the intervals closer to the means).
  2. Looking into the R CBS structure versus the Python construction, the data vector lengths in R are 28188, 19697, 1151, 63150 whereas in Python they are 41444, 24795, 1152, 78806. This might explain why the means are also so different, besides weighting. How does the Python function pull the data points for the segments? Seems to grab more than R.
  3. The interval you propose is for the mean of means (like a nested model), and are very, very narrow (should get narrower and narrower for longer intervals). The approach I proposed estimates the confidence interval for the bin mean. Since this is already post segmentation, then the bins should be IID. I know this is a bit hand-wavy.
etal commented 9 years ago
  1. I just passed this issue upstream to Henrik. I agree that if segments means are weighted, then resampled means should be too. A less statistically efficient alternative would be to use the median instead of the weighted average for both the reported and resampled segment values. And I suppose that option should be exposed on the command line, too.
  2. For now, at least, it might be best to do the recalculation in R, using the fit dataframes, instead of Python. CNVkit's by_segment method is not fancy; I think the segmentation algorithm is silently masking out some bins.
  3. Meaning, each bin is already the average of per-base read depths, so the segment mean is the mean of means? I think the bootstrap calculation is supposed to be the same as the original segment mean calculation, but with resampled data points each time, to estimate the "confidence interval for the segment mean" as it's usually understood. The CI size should get smaller for thousands of data points. The 2.5-97.5 percentile range is the prediction interval, which is also useful but means something a bit different.
mjafin commented 9 years ago
  1. Wonderful, thanks!
  2. Yes, seems so. Let's wait to hear back if masking out some bins is a bug or feature.
  3. I guess I'm looking at the utility of the interval, however we call it. The segment mean here being a mean of means (should be very smooth), I'm not sure its confidence interval is very interesting. If we take any long stretch of the chromosome, no matter how close its segment value is to zero it's bound to have a confidence interval that doesn't contain the value zero. The percentile range, if I'm not mistaken, is the prediction interval of the segment mean but also the confidence interval estimate of the bin means (which should be ~identically distributed within a segment). I think it better reflects depth of sequencing as I've seen it grow and shrink appropriately when using different average depths for the same sample.
etal commented 9 years ago

I see now. I think it might be wise to add a segmetrics command that calculates stats similar to the metrics command, but per-segment. The MAD, IQR, etc. might also be useful to be able to report.

etal commented 9 years ago

About those missing probes: CNVkit's R script that runs PSCBS filters out probes with log2 value < -20. These are usually very-low-coverage targets near telomeres and centromeres, and are more likely to appear in long segments that cover a whole chromosome arm. I'll confirm manually. If that's it, then the same filter should be applied when recalculating segment means.

mjafin commented 9 years ago

Nice debugging! Agree with you on segmetrics, would be excellent. I'm happy to help in implementing, testing and debugging any new features.

HenrikBengtsson commented 9 years ago

Dropping by to say that the fact that the mean levels return by PSCBS::segmentByCBS() are the non-weighted mean levels when weighted segmentation is used, is indeed a bug. I've create PSCBS Issue #18 for this. Thanks for spotting it.

etal commented 9 years ago

I've added a segmetrics command with CI and PI options; care to take a look @mjafin ?

mjafin commented 9 years ago

Fantastic, thanks @etal, I'll test these in the next few days!

mjafin commented 9 years ago

There might be a small oversight somewhere, as my cns doesn't have segments in all chromosomes and produces the following error:

[klrl262@chara: /ngs/oncology/analysis/external/CCLE/HCC1954/downsampling_exercise/genereadV2_GTL16_EBC1/work/structural ]$ cnvkit.py segmetrics -s GTL16/cnvkit/raw/GTL16-ready.cns -o temp.txt --stdev --mad --iqr --bivar --ci --pi GTL16/cnvkit/raw/GTL16-ready.targetcoverage.cnn
Traceback (most recent call last):
  File "/home/klrl262/miniconda/envs/bcbio-test/bin/cnvkit.py", line 4, in <module>
    __import__('pkg_resources').run_script('CNVkit==0.5.2', 'cnvkit.py')
  File "/home/klrl262/miniconda/envs/bcbio-test/lib/python2.7/site-packages/setuptools-14.3-py2.7.egg/pkg_resources/__init__.py", line 698, in run_script
  File "/home/klrl262/miniconda/envs/bcbio-test/lib/python2.7/site-packages/setuptools-14.3-py2.7.egg/pkg_resources/__init__.py", line 1623, in run_script
  File "/home/klrl262/miniconda/envs/bcbio-test/lib/python2.7/site-packages/CNVkit-0.5.2-py2.7.egg/EGG-INFO/scripts/cnvkit.py", line 9, in <module>

  File "build/bdist.linux-x86_64/egg/cnvlib/commands.py", line 1333, in _cmd_segmetrics

  File "build/bdist.linux-x86_64/egg/cnvlib/cnarray.py", line 313, in by_segment
ValueError: Segments are missing chromosome 'chr16'

Should the positional arguments include both the cnn and cnr file or just one (and not the other)? Does order matter?

etal commented 9 years ago

The command line you used was right. Only one .cnr or .cnn file can be used at a time, and then the -s segments option is required.

I fixed a bug where it would crash on .cnn files because of the missing "weight" column. So the .cnr and .cnn files that were used to create the .cns file should all work now.

The missing chromosome bug is going to be fixed when the pandastyle branch lands; the pandas.DataFrame backend makes this easier to implement. For now, you could either manually add a "dummy" line to the .cns file, covering the full length of the missing chromosome and with neutral copy number (log2=0), or else drop the offending rows from the other .cnn/.cnr file. (Sorry.)

Did CNVkit generate this .cns file with missing chromosomes? Any idea how that happened, or do you think it's a bug?

mjafin commented 9 years ago

Thanks Eric, Does it make any difference which one to use, the .cnr or .cnn?

The data is from a small targeted panel and it looks like even though chr16 has a bit of data nothing got detected as having a copy number change and thus (I believe) there are no entries in the .cns file for chr16 (or chr18, chr21 and chrY).

mjafin commented 9 years ago

Upon further testing (different data) I got this error too:

Traceback (most recent call last):
  File "/home/klrl262/miniconda/envs/bcbio-test/bin/cnvkit.py", line 4, in <module>
    __import__('pkg_resources').run_script('CNVkit==0.6.0.dev0', 'cnvkit.py')
  File "/home/klrl262/miniconda/envs/bcbio-test/lib/python2.7/site-packages/setuptools-14.3-py2.7.egg/pkg_resources/__init__.py", line 698, in run_script
  File "/home/klrl262/miniconda/envs/bcbio-test/lib/python2.7/site-packages/setuptools-14.3-py2.7.egg/pkg_resources/__init__.py", line 1623, in run_script
  File "/home/klrl262/miniconda/envs/bcbio-test/lib/python2.7/site-packages/CNVkit-0.6.0.dev0-py2.7.egg/EGG-INFO/scripts/cnvkit.py", line 9, in <module>

  File "build/bdist.linux-x86_64/egg/cnvlib/commands.py", line 1357, in _cmd_segmetrics

  File "build/bdist.linux-x86_64/egg/cnvlib/cnarray.py", line 109, in __setitem__
ValueError: cannot copy sequence with size 11 to array axis with dimension 12

Do you know what this could be about at all?

etal commented 9 years ago
  1. Using .cnn vs. .cnr asks/answers slightly different questions. The .cnr file has the bias-corrected, normalized bin-level log2 ratios that were used to infer the segments and segment means, so it's the appropriate input for the CI and PI stats; the CI and PI from .cnn file may not contain the segment mean. The .cnn has the raw per-target coverage depths, simply log2-scaled, so it could reasonably be used with the MAD, IQR and other "spread" metrics to measure how noisy the original sources are; the same metric could be run with .cnr too and compared to see how much different the bias corrections make. If your capture method is hybrid selection, you have separate target and antitarget .cnn files, whereas the .cnr file combines the two. If you did targeted amplicon capture, there are no antitargets so the distinction between .cnn and .cnr is more subtle.
  2. Could you post or send me the .cnr file you used where some chromosomes were dropped from the .cns file? I think it's a bug in segmentation that I can fix, but I haven't been able to replicate it yet.
  3. Looks like a chromosome was dropped in the by_segments() method internally. Can you send me example files to replicate it?
mjafin commented 9 years ago

Thanks for the clarification again! I've emailed you (first.last@ucsf.edu?) the two cases, let me know if you don't get my email.

etal commented 9 years ago

Got it, thanks! I'll take a look and follow up here.

etal commented 9 years ago

I've updated the segment command to avoid dropping some of the low-coverage bins (d1ba5cae4d85716a0df078a9d06c473fefa9f0d4). This will make both bugs appear less often, I think, and also help identify complete deletions of genes.

I'm working on a proper fix to ensure segmentation emits better-behaved .cns files.

etal commented 9 years ago

With the change I just pushed, if you redo segmentation with the example .cnr files that previously triggered these bugs, then segmetrics should work on the updated .cns files. Does it work for you now?

mjafin commented 9 years ago

Hi Eric, I can confirm the analyses run great now, no errors. Many thanks again