Sample sex is not always detected correctly

aleksandrabliz commented 8 years ago

Hi! I stumbled upon a problem on the stage of sample sex determination.

In group of samples where about half were male and the other half female, CNVKit determined most of them (> 85%) to be female from target region coverage. The problem seems to reside in function guess_xx() that guesses whether a sample is female from chrX relative coverage. It uses a fixed cutoff value of −0.5 for raw data, but it turned out this is not always adequate. We found that for our targeted sequencing datasets antimode of relative chrX coverage can range anywhere from −0.9 to −0.1 for raw probe coverage, depending on the sequencing library. It can also be different for target and antitarget distributions. We haven't yet determined the cause of that, but our data seem to be completely normal in any other regard. These antimode values usually differ between targeted sequencing libraries, but stay the same for any given library.

The end consequence of this problem, if uncorrected, is that chrX reference coverage is mixed up and unusable due to incorrect interpretation of male and female samples during its construction. During calling stage it also causes spurious heterozygous deletions in chrX to appear in male samples that were incorrectly labeled as female.

As a temporary fix for our workflow, I added two parameters to CNVKit, --tthreshold and --athreshold, which allow the user to specify correct thresholds for target and antitarget coverages directly. You can check my version out here. Our current workflow is to plot raw target/antitarget coverages, then determine the distribution antimodes visually, and later specify them during reference construction and calling steps.

I can start a pull request with these edits if you'd like; however, this is not a very elegant solution in my opinion, and possibly confusing to end users. Maybe you'll want to implement automatic threshold determination during reference construction instead. Anyway, please let me know what you think about this issue, and if I could be of any further assistance. And thanks for all your work on CNVKit, it's really useful to us!

Alexandra

etal commented 8 years ago

Thanks for letting me know, this is a little surprising. Could you post some of your problematic .cnn files somewhere so that I can use them for testing?

Is there anything unusual about your library preparation protocol, or could there be something wrong with one of your capture kits? The coverage of chrX should be similar to the autosomes in female samples and half in male samples, so usually a cutoff of anywhere between -.3 and -.7 works on data I've seen.

As another solution, I could add an option to the batch command to take the sample genders as specified, e.g. read from a tabular file or using some extra command-line syntax. The other commands in CNVkit that work on single samples already let the sample's sex be specified with the -g option if known; please let me know if I've missed any in the current development version.

aleksandrabliz commented 8 years ago

I posted some of .cnn files here. Also you can see sample's sex determination results by `cnvkit.py gender .cnnfunction [here](https://docs.google.com/spreadsheets/d/1SVD2Mnv085Fl4a7-yoEq4OCOw0mx8ekv_BheDx6WTlA/edit#gid=0). There are results for two libraries: TruSight (Illumina) and Genotek01 (it's ours). We don't have problems with library preparation protocol and usually get stable results for different libraries. But sometimes problem of incorrect sex detection arises. Of course, both of these solutions are good, but, unfortunately, we don't usecnvkit.py batch` function in our pipeline and usually don't have sample's sex information. That's why these solutions are not convenient to us. I think that automatic threshold determination during reference construction would be better. If you have some another idea, please let me know.

etal commented 8 years ago

So: Based on the distributions of both chrX and chrY coverages, but especially chrY, sample mj1813 looks female and the rest look male:

heatmap

I've updated the gender command to show both chrX and chrY log2 coverages relative to the autosomes:

sample          gender  rel.chr.X   rel.chr.Y
jt9333.t.cnn    Female  -0.364      +0.353
jt9333.a.cnn    Male    -1.29       -0.918
jt9333.cnr      Female  -0.409      +0.249
jt9333.cns      Male    -0.613      +0.407

mj1813.t.cnn    Female  +0.177      -23
mj1813.a.cnn    Male    -0.976      -15.6
mj1813.cnr      Female  +0.00907    -22.1
mj1813.cns      Female  -0.0898     -7.77

od7234.t.cnn    Female  -0.342      +0.397
od7234.a.cnn    Male    -1.44       -0.723
od7234.cnr      Female  -0.371      +0.48
od7234.cns      Male    -0.52       +0.618

oj0720.t.cnn    Female  -0.455      -0.7
oj0720.a.cnn    Male    -0.978      -0.691
oj0720.cnr      Female  -0.253      +0.0918
oj0720.cns      Female  -0.39       +0.0997

xn6664.t.cnn    Female  -0.499      -0.776
xn6664.a.cnn    Male    -1.03       -0.79
xn6664.cnr      Female  -0.279      +5e-06
xn6664.cns      Female  -0.365      -0.0831

This is with a female reference, so chrX and chrY should both be near -1.0 and 0.0 for male samples, and 0.0 and well below -1.0 for female samples. I used segment --drop-low-coverage, and the .cns files would have called the chromosomal sex correctly if the cutoff were 0.3 instead of 0.5. The development version of CNVkit has improvements to centering, which may be what brought mj's chrX close to where it should be.

These samples are pretty noisy. The antitargets might be improved with a larger bin size, e.g. 200000 instead of 100000. The target bin sizes are mostly below 200, so a little smaller than expected for exons; did you use the BED file for baits (better) or primary targets (worse)?

I'll try some different approaches to automatically detecting gender using these samples as the test dataset; median versus a fixed threshold does not seem to be accurate enough.

etal commented 8 years ago

I've changed the test to use a couple of Mann-Whitney tests for difference in means, and choose the better-fitting chromosomal sex. I also changed the calculation of relative log2 ratios to match the method used in calculating segment means. The calls are better now:

sample          gender  X_logratio  Y_logratio
jt9333.t.cnn    Male    -0.815      -0.0786
jt9333.a.cnn    Male    -1.52       -0.98
jt9333.cnr      Male    -0.664      -0.0498
jt9333.cns      Male    -0.626      +0.408

mj1813.t.cnn    Female  +0.352      -16.6
mj1813.a.cnn    Female  -1.6        -11.4
mj1813.cnr      Female  +0.00247    -15.5
mj1813.cns      Female  -0.0898     -10.4

od7234.t.cnn    Male    -0.659      -0.0705
od7234.a.cnn    Male    -1.88       -1.39
od7234.cnr      Male    -0.591      -0.113
od7234.cns      Male    -0.547      +0.597

oj0720.t.cnn    Male    -0.569      -0.804
oj0720.a.cnn    Male    -1.39       -0.669
oj0720.cnr      Male    -0.426      +0.0958
oj0720.cns      Male    -0.393      +0.105

xn6664.t.cnn    Male    -0.604      -0.958
xn6664.a.cnn    Male    -1.47       -0.958
xn6664.cnr      Male    -0.42       -0.0864
xn6664.cns      Male    -0.359      -0.0707

The chromosome-wide averages are still fairly bizarre, but the suggestions above for reducing noise may help with that.

etal / cnvkit

Sample sex is not always detected correctly #116