Closed mjafin closed 9 years ago
Hi Miika, here are a couple of things that come to mind that can be done with the current code base and the cnvlib
API:
weight
column in that file represents, for each bin, 1/variance of log2 coverages in that bin across all of the given samples. (Coverage depths do not appear to be normally distributed; I think the log2 transform brings the distribution closer to normal.) This is used for segmentation but variance/weight isn't shown for the segments. However, with some math you may be able to summarize these values across each segment.cnvlib.metrics
module provides several estimators of scale, including MAD, interquartile range and a robust weighted variance.Modelling the reliability or confidence interval of each segment based on the underlying read depths, as you described, is really more a feature of the segmentation algorithm. For example, the recently published CODEX introduces a segmentation method based on raw read counts and their expected distribution, not entirely unlike cn.MOPS:
CNVkit has a modular approach to segmentation: Just implement a wrapper for a method in cnvlib/segmentation/__init__.py
and call it with the segment
command. A wrapper for or reimplementation of CODEX's segmentation procedure would be nice.
After the next point release I'm going to change the coverage
command to not median-center the (target|antitarget)coverage.cnn files before writing, so that the actual read counts are still available.
I'm also working on replacing the internal representation of genomic intervals to use Pandas instead of Numpy structured arrays, so that .cnr and .cns files can more easily store arbitrary additional columns -- more like Bioconductor's GenomicRange package. (If you know of an existing Python package that provides pandas-based GenomicRanges, please let me know!) This would provide a place to store the segment-level metadata other segmentation algorithms emit, if any.
If you'd like to work on something more exploratory I'd be happy help with that, too -- just e-mail me.
@etal Thanks for the very comprehensive reply, much appreciated. I'll take a look at the weight
column to begin with. Bootstrapping might give us more accurate intervals but I'd hate to have to go to resampling if it implies a lot longer computational time.
Not seen the CODEX paper before, will give it a read. I'm not an expert in Python numerical packages but I'll keep an eye out for any that may reproduce GenomicRanges like functionality.
Just a quick update, I've been downsampling a publicly available cell line (pair) with a known amplification to observe what happens to the estimated copy number. When going from avg coverage of 5x to 0.5x the log2 copy number stays roughly around the same (nice). I did a very crude estimation of the lower bound of a 95% confidence interval (if you can call it that) based on the 2.5% percentile of all the log2 values in the cnr file for the region. Given that the individual intervals are roughly the same width for all the values I didn't do any weighting of the values.
At 5x the lower bound is roughly 0.8 log2 units below the copy number and at 0.5x it's roughly 2 log2 units below. Nice to see the numbers support the fact that with lower coverage there's more variability.
@etal I've been toying around with interval estimation at https://github.com/etal/cnvkit/blob/master/cnvlib/segmentation/__init__.py#L148
The following code would estimate a prediction/confidence interval from the distribution of the bin means:
fit$output$quants = rep(NA, length(fit$output$mean))
for (ind in 1:length(fit$output$quants)){
if (!is.na(fit$segRows$startRow[ind]))
fit$output$quants[ind] = paste(quantile(fit$data$y[fit$segRows$startRow[ind]:fit$segRows$endRow[ind]], c(0.025,0.975)), collapse=";")
}
However, I'm not sure how to best handle this at https://github.com/etal/cnvkit/blob/master/cnvlib/segmentation/__init__.py#L47 to get the interval printed out in the cns file. Do you have any recommendations/suggestions? I've tried a few things to inject the interval into the output but I get cryptic errors from numpy.
It looks like you're formatting the confidence interval as a string, so you could put it in the "gene" column of the .cns file, as it looks like you were aiming to do given the line you selected.
The R code you pasted could be replicated in Python using the function numpy.percentile
and a similar-looking loop otherwise. If you feel comfortable trying that, I'd be happy to take a pull request, otherwise let me know and I'll give it a shot.
Incidentally, or maybe more importantly, is the 2.5-97.5-percentile range in the observed data really the 95% confidence interval as we usually think of it?
Thanks @etal, I'll look into piggy backing the interval somehow in the gene column and will prepare a pull request.
I was thinking of using the Python percentile function too, but not all the information required is returned from the R function so rather than bloating the table R returns, I'd say it's easier to carry out the calculations in R.
I've been thinking about the philosophical side of the percentile range too and what it means. On one hand, I think what I'm proposing is a sort of resampling based confidence interval for a (one) bin, as it's based on observing the distribution of the (log2) bin means from the same segment. The bin means are in a way repeated measurements of the same thing (within segment). On the other hand, it's also a prediction interval of the segment total mean.
Which information is missing from the table returned by the R function? Is it the mapping of probes to segments? If so, it would make sense to add that. I can do so.
I just used a script to discover:
Noticed yesterday too that the mean by CBS is literally just the mean of the values ignoring weights and the x-axis (wasn't sure if it was intentional or not)!
I had a go using your script - here's an example output table:
chromosome start end gene log2 probes
chr17 133 22239009 CI=-5.88_-5.74 -6.26166 28189
chr17 25280863 37640219 CI=-3.51_-3.32 -3.00418 19698
chr17 37640219 38137566 CI=5.93_6.06 5.96198 1152
chr17 38137566 81191601 CI=-3.26_-3.16 -3.01774 63151
Here are the results of my approach (ignoring weights and trusting the CBS means):
chromosome start end gene log2 probes
chr17 133 22239009 G;CI=-3.298384_4.077177 0.373 28189
chr17 25280863 37640219 G;CI=-3.0434715_4.139435 0.5993 19698
chr17 37640219 38137566 G;CI=3.944334_8.784433 5.9891 1152
chr17 38137566 81191601 G;CI=-3.00873_4.322255 0.6752 63151
There are a few issues that would be good to discuss:
28188, 19697, 1151, 63150
whereas in Python they are 41444, 24795, 1152, 78806
. This might explain why the means are also so different, besides weighting. How does the Python function pull the data points for the segments? Seems to grab more than R.fit
dataframes, instead of Python. CNVkit's by_segment
method is not fancy; I think the segmentation algorithm is silently masking out some bins.I see now. I think it might be wise to add a segmetrics
command that calculates stats similar to the metrics
command, but per-segment. The MAD, IQR, etc. might also be useful to be able to report.
About those missing probes: CNVkit's R script that runs PSCBS filters out probes with log2 value < -20. These are usually very-low-coverage targets near telomeres and centromeres, and are more likely to appear in long segments that cover a whole chromosome arm. I'll confirm manually. If that's it, then the same filter should be applied when recalculating segment means.
Nice debugging! Agree with you on segmetrics
, would be excellent. I'm happy to help in implementing, testing and debugging any new features.
Dropping by to say that the fact that the mean levels return by PSCBS::segmentByCBS()
are the non-weighted mean levels when weighted segmentation is used, is indeed a bug. I've create PSCBS Issue #18 for this. Thanks for spotting it.
I've added a segmetrics
command with CI and PI options; care to take a look @mjafin ?
Fantastic, thanks @etal, I'll test these in the next few days!
There might be a small oversight somewhere, as my cns doesn't have segments in all chromosomes and produces the following error:
[klrl262@chara: /ngs/oncology/analysis/external/CCLE/HCC1954/downsampling_exercise/genereadV2_GTL16_EBC1/work/structural ]$ cnvkit.py segmetrics -s GTL16/cnvkit/raw/GTL16-ready.cns -o temp.txt --stdev --mad --iqr --bivar --ci --pi GTL16/cnvkit/raw/GTL16-ready.targetcoverage.cnn
Traceback (most recent call last):
File "/home/klrl262/miniconda/envs/bcbio-test/bin/cnvkit.py", line 4, in <module>
__import__('pkg_resources').run_script('CNVkit==0.5.2', 'cnvkit.py')
File "/home/klrl262/miniconda/envs/bcbio-test/lib/python2.7/site-packages/setuptools-14.3-py2.7.egg/pkg_resources/__init__.py", line 698, in run_script
File "/home/klrl262/miniconda/envs/bcbio-test/lib/python2.7/site-packages/setuptools-14.3-py2.7.egg/pkg_resources/__init__.py", line 1623, in run_script
File "/home/klrl262/miniconda/envs/bcbio-test/lib/python2.7/site-packages/CNVkit-0.5.2-py2.7.egg/EGG-INFO/scripts/cnvkit.py", line 9, in <module>
File "build/bdist.linux-x86_64/egg/cnvlib/commands.py", line 1333, in _cmd_segmetrics
File "build/bdist.linux-x86_64/egg/cnvlib/cnarray.py", line 313, in by_segment
ValueError: Segments are missing chromosome 'chr16'
Should the positional arguments include both the cnn and cnr file or just one (and not the other)? Does order matter?
The command line you used was right. Only one .cnr or .cnn file can be used at a time, and then the -s segments option is required.
I fixed a bug where it would crash on .cnn files because of the missing "weight" column. So the .cnr and .cnn files that were used to create the .cns file should all work now.
The missing chromosome bug is going to be fixed when the pandastyle
branch lands; the pandas.DataFrame backend makes this easier to implement. For now, you could either manually add a "dummy" line to the .cns file, covering the full length of the missing chromosome and with neutral copy number (log2=0), or else drop the offending rows from the other .cnn/.cnr file. (Sorry.)
Did CNVkit generate this .cns file with missing chromosomes? Any idea how that happened, or do you think it's a bug?
Thanks Eric, Does it make any difference which one to use, the .cnr or .cnn?
The data is from a small targeted panel and it looks like even though chr16 has a bit of data nothing got detected as having a copy number change and thus (I believe) there are no entries in the .cns file for chr16 (or chr18, chr21 and chrY).
Upon further testing (different data) I got this error too:
Traceback (most recent call last):
File "/home/klrl262/miniconda/envs/bcbio-test/bin/cnvkit.py", line 4, in <module>
__import__('pkg_resources').run_script('CNVkit==0.6.0.dev0', 'cnvkit.py')
File "/home/klrl262/miniconda/envs/bcbio-test/lib/python2.7/site-packages/setuptools-14.3-py2.7.egg/pkg_resources/__init__.py", line 698, in run_script
File "/home/klrl262/miniconda/envs/bcbio-test/lib/python2.7/site-packages/setuptools-14.3-py2.7.egg/pkg_resources/__init__.py", line 1623, in run_script
File "/home/klrl262/miniconda/envs/bcbio-test/lib/python2.7/site-packages/CNVkit-0.6.0.dev0-py2.7.egg/EGG-INFO/scripts/cnvkit.py", line 9, in <module>
File "build/bdist.linux-x86_64/egg/cnvlib/commands.py", line 1357, in _cmd_segmetrics
File "build/bdist.linux-x86_64/egg/cnvlib/cnarray.py", line 109, in __setitem__
ValueError: cannot copy sequence with size 11 to array axis with dimension 12
Do you know what this could be about at all?
Thanks for the clarification again! I've emailed you (first.last@ucsf.edu?) the two cases, let me know if you don't get my email.
Got it, thanks! I'll take a look and follow up here.
I've updated the segment
command to avoid dropping some of the low-coverage bins (d1ba5cae4d85716a0df078a9d06c473fefa9f0d4). This will make both bugs appear less often, I think, and also help identify complete deletions of genes.
I'm working on a proper fix to ensure segmentation emits better-behaved .cns files.
With the change I just pushed, if you redo segmentation with the example .cnr files that previously triggered these bugs, then segmetrics
should work on the updated .cns files. Does it work for you now?
Hi Eric, I can confirm the analyses run great now, no errors. Many thanks again
Hi @etal, wanted to pick your brain on how to implement some kind of "confidence intervals" around the copy number estimates. Ideally there could be an interval around the point estimate that would reflect depth of sequencing and smoothness of the copy number signal around the interval (could be e.g. MAD http://en.wikipedia.org/wiki/Median_absolute_deviation with a provision for depth).
For example, a log2 copy number of 1 surely should be more meaningful in 20x coverage data compared to 1x coverage data, right?
I'm more than happy to implement something and help but wanted to hear your thoughts first.