Closed aleksandrabliz closed 8 years ago
Thanks for letting me know, this is a little surprising. Could you post some of your problematic .cnn files somewhere so that I can use them for testing?
Is there anything unusual about your library preparation protocol, or could there be something wrong with one of your capture kits? The coverage of chrX should be similar to the autosomes in female samples and half in male samples, so usually a cutoff of anywhere between -.3 and -.7 works on data I've seen.
As another solution, I could add an option to the batch
command to take the sample genders as specified, e.g. read from a tabular file or using some extra command-line syntax. The other commands in CNVkit that work on single samples already let the sample's sex be specified with the -g
option if known; please let me know if I've missed any in the current development version.
I posted some of .cnn files here. Also you can see sample's sex determination results by `cnvkit.py gender .cnnfunction [here](https://docs.google.com/spreadsheets/d/1SVD2Mnv085Fl4a7-yoEq4OCOw0mx8ekv_BheDx6WTlA/edit#gid=0). There are results for two libraries: TruSight (Illumina) and Genotek01 (it's ours). We don't have problems with library preparation protocol and usually get stable results for different libraries. But sometimes problem of incorrect sex detection arises. Of course, both of these solutions are good, but, unfortunately, we don't use
cnvkit.py batch` function in our pipeline and usually don't have sample's sex information. That's why these solutions are not convenient to us. I think that automatic threshold determination during reference construction would be better. If you have some another idea, please let me know.
So: Based on the distributions of both chrX and chrY coverages, but especially chrY, sample mj1813
looks female and the rest look male:
I've updated the gender
command to show both chrX and chrY log2 coverages relative to the autosomes:
sample gender rel.chr.X rel.chr.Y
jt9333.t.cnn Female -0.364 +0.353
jt9333.a.cnn Male -1.29 -0.918
jt9333.cnr Female -0.409 +0.249
jt9333.cns Male -0.613 +0.407
mj1813.t.cnn Female +0.177 -23
mj1813.a.cnn Male -0.976 -15.6
mj1813.cnr Female +0.00907 -22.1
mj1813.cns Female -0.0898 -7.77
od7234.t.cnn Female -0.342 +0.397
od7234.a.cnn Male -1.44 -0.723
od7234.cnr Female -0.371 +0.48
od7234.cns Male -0.52 +0.618
oj0720.t.cnn Female -0.455 -0.7
oj0720.a.cnn Male -0.978 -0.691
oj0720.cnr Female -0.253 +0.0918
oj0720.cns Female -0.39 +0.0997
xn6664.t.cnn Female -0.499 -0.776
xn6664.a.cnn Male -1.03 -0.79
xn6664.cnr Female -0.279 +5e-06
xn6664.cns Female -0.365 -0.0831
This is with a female reference, so chrX and chrY should both be near -1.0 and 0.0 for male samples, and 0.0 and well below -1.0 for female samples. I used segment --drop-low-coverage
, and the .cns files would have called the chromosomal sex correctly if the cutoff were 0.3 instead of 0.5. The development version of CNVkit has improvements to centering, which may be what brought mj's chrX close to where it should be.
These samples are pretty noisy. The antitargets might be improved with a larger bin size, e.g. 200000 instead of 100000. The target bin sizes are mostly below 200, so a little smaller than expected for exons; did you use the BED file for baits (better) or primary targets (worse)?
I'll try some different approaches to automatically detecting gender using these samples as the test dataset; median versus a fixed threshold does not seem to be accurate enough.
I've changed the test to use a couple of Mann-Whitney tests for difference in means, and choose the better-fitting chromosomal sex. I also changed the calculation of relative log2 ratios to match the method used in calculating segment means. The calls are better now:
sample gender X_logratio Y_logratio
jt9333.t.cnn Male -0.815 -0.0786
jt9333.a.cnn Male -1.52 -0.98
jt9333.cnr Male -0.664 -0.0498
jt9333.cns Male -0.626 +0.408
mj1813.t.cnn Female +0.352 -16.6
mj1813.a.cnn Female -1.6 -11.4
mj1813.cnr Female +0.00247 -15.5
mj1813.cns Female -0.0898 -10.4
od7234.t.cnn Male -0.659 -0.0705
od7234.a.cnn Male -1.88 -1.39
od7234.cnr Male -0.591 -0.113
od7234.cns Male -0.547 +0.597
oj0720.t.cnn Male -0.569 -0.804
oj0720.a.cnn Male -1.39 -0.669
oj0720.cnr Male -0.426 +0.0958
oj0720.cns Male -0.393 +0.105
xn6664.t.cnn Male -0.604 -0.958
xn6664.a.cnn Male -1.47 -0.958
xn6664.cnr Male -0.42 -0.0864
xn6664.cns Male -0.359 -0.0707
The chromosome-wide averages are still fairly bizarre, but the suggestions above for reducing noise may help with that.
Hi! I stumbled upon a problem on the stage of sample sex determination.
In group of samples where about half were male and the other half female, CNVKit determined most of them (> 85%) to be female from target region coverage. The problem seems to reside in function guess_xx() that guesses whether a sample is female from chrX relative coverage. It uses a fixed cutoff value of −0.5 for raw data, but it turned out this is not always adequate. We found that for our targeted sequencing datasets antimode of relative chrX coverage can range anywhere from −0.9 to −0.1 for raw probe coverage, depending on the sequencing library. It can also be different for target and antitarget distributions. We haven't yet determined the cause of that, but our data seem to be completely normal in any other regard. These antimode values usually differ between targeted sequencing libraries, but stay the same for any given library.
The end consequence of this problem, if uncorrected, is that chrX reference coverage is mixed up and unusable due to incorrect interpretation of male and female samples during its construction. During calling stage it also causes spurious heterozygous deletions in chrX to appear in male samples that were incorrectly labeled as female.
As a temporary fix for our workflow, I added two parameters to CNVKit,
--tthreshold
and--athreshold
, which allow the user to specify correct thresholds for target and antitarget coverages directly. You can check my version out here. Our current workflow is to plot raw target/antitarget coverages, then determine the distribution antimodes visually, and later specify them during reference construction and calling steps.I can start a pull request with these edits if you'd like; however, this is not a very elegant solution in my opinion, and possibly confusing to end users. Maybe you'll want to implement automatic threshold determination during reference construction instead. Anyway, please let me know what you think about this issue, and if I could be of any further assistance. And thanks for all your work on CNVKit, it's really useful to us!
Alexandra