etal / cnvkit

Copy number variant detection from targeted DNA sequencing
http://cnvkit.readthedocs.org
Other
544 stars 165 forks source link

What are the expected log2 values on chrY for female sample vs female reference #318

Closed rbatorsky closed 6 years ago

rbatorsky commented 6 years ago

Hello, I'm using cnvkit v0.9.1 and I am confused about the negative log2 values for chrY that I get when comparing a female sample to a female reference.

I have built a female reference using a cohort of 20 female WES samples:

cnvkit.py target $targetbed --split -o my_targets.bed
cnvkit.py antitarget $targetbed --access $accessbed -o my_antitargets.bed
cnvkit.py coverage $bam my_targets.bed -o ${bam}.targetcoverage.cnn -p $task.cpus
cnvkit.py coverage $bam my_antitargets.bed -o ${bam}.antitargetcoverage.cnn -p $task.cpus
cnvkit.py reference *cnn --fasta genome.fa -o reference.cnn 

Then, I'm running individual samples from this cohort against this reference using this reference using batch mode:

cnvkit.py batch $bam -r $referencecnn --output-dir '.' -p ${task.cpus} 

I typically see segments with negative log2 values, and I am concerned that I do not have the right method.

For example: chrom svstart svend log2 depth cn
chrY 10500 2850583 -24.2234 0.00986237 0
chrY 2851083 6734179 -3.17169 23.4804 0
chrY 6735705 14460138 -11.6599 1.43891 0
chrY 14460638 15817151 -22.8281 0.0683646 0
chrY 15817251 20628581 -11.4769 1.95446 0
chrY 20629081 22943009 -20.5391 0.444197 0
chrY 22943509 28783629 -10.1198 0.195285 0
chrY 58968156 59363066 -24.2333 0.0549595 0

The scatter plot looks like this:

screen shot 2018-02-15 at 2 53 52 pm

The inferred copy number is always zero, as expected for chrY for female, but I expected log2 values ~0, instead they are large and negative. Thanks for any insight and for a great tool!

etal commented 6 years ago

Any sequencing reads that map to chrY samples can be treated as noise, either misaligned or amgibuously aligned to pseudogenes or the pseudoautosomal region.

Theoretically the log2 read-depth ratio on chrY for female samples normalized to a female reference is log(0/0) = NaN. CNVkit also fills in missing log2 values with -20 (which then drifts a bit after GC correction and re-centering).

For practicality, CNVkit makes chrY haploid in a "female" reference, so regardless of the reference gender, chrY will show log2 values around 0 for male samples some arbitrary negative number for female samples -- the values that you see here are typical.

If your cohort is all female, it would be reasonable to just delete chrY from either your reference or each of your samples. Then chrY wouldn't show up in the resulting plots and tables.

JspSrs commented 6 years ago

Sorry,

Theoretically speaking is treating reads on Y in a female vs female-ref as noise rather correct. To my opinion one should be more strict and then that idea is only true for spurious reads (i.e. derived from capturing with WES due to affinity or contamination-like in WGS), mapping to non-PAR. Females with Y-sequences do exist, e.g. via a Disorder of (Sexual) Development. Or ‘normal’ females carrying a segment as a gain on another chromosome.

PAR1 and -2 can often be handled in software, although borders do breathe a bit upon underlying technique and software.

So, while indeed it is reasonable to exclude Y from your analysis, I would not do that as hiding it might also hide real leads to results. But prob that depends a lot on your research question, while I am looking from the perspective of diagnostics.

Best, Jasper

From: Eric Talevich [mailto:notifications@github.com] Sent: 16 February 2018 02:36 To: etal/cnvkit Cc: Subscribed Subject: Re: [etal/cnvkit] What are the expected log2 values on chrY for female sample vs female reference (#318)

Any sequencing reads that map to chrY samples can be treated as noise, either misaligned or amgibuously aligned to pseudogenes or the pseudoautosomal region.

Theoretically the log2 read-depth ratio on chrY for female samples normalized to a female reference is log(0/0) = NaN. CNVkit also fills in missing log2 values with -20 (which then drifts a bit after GC correction and re-centering).

For practicality, CNVkit makes chrY haploid in a "female" reference, so regardless of the reference gender, chrY will show log2 values around 0 for male samples some arbitrary negative number for female samples -- the values that you see here are typical.

If your cohort is all female, it would be reasonable to just delete chrY from either your reference or each of your samples. Then chrY wouldn't show up in the resulting plots and tables.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/etal/cnvkit/issues/318#issuecomment-366120247, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AIfPvscF-3l7vIFFYq9LPG3inG6Guvuaks5tVNudgaJpZM4SHa_L.

rbatorsky commented 6 years ago

Thanks very much for the helpful response. I just have a clarifying question about "CNVkit makes chrY haploid in a 'female' reference". From the documentation sex.rst, I understand that by default chrX is considered diploid, and male reference samples have coverage on X doubled to resemble a diploid X. How are coverages scaled in female samples to resemble haploid Y? Are depth values in chrY bins doubled? Thanks again.

etal commented 6 years ago

When building the reference, the chrY values from apparent female control samples are all replaced with -1. This makes it possible to normalize a chromosomally male test sample (i.e. any containing a real chrY) to a reference built from all chromosomally normal female samples, and addresses @JspSrs's other caveats.

I agree it's surprising, but the alternatives all seem worse in one way or another. In the development/upcoming version of CNVkit, the documentation no longer refers to the reference "sex", and instead just describes the option of whether chrX should be haploid in the reference.