What are the expected log2 values on chrY for female sample vs female reference

rbatorsky commented 6 years ago

Hello, I'm using cnvkit v0.9.1 and I am confused about the negative log2 values for chrY that I get when comparing a female sample to a female reference.

I have built a female reference using a cohort of 20 female WES samples:

cnvkit.py target $targetbed --split -o my_targets.bed
cnvkit.py antitarget $targetbed --access $accessbed -o my_antitargets.bed
cnvkit.py coverage $bam my_targets.bed -o ${bam}.targetcoverage.cnn -p $task.cpus
cnvkit.py coverage $bam my_antitargets.bed -o ${bam}.antitargetcoverage.cnn -p $task.cpus
cnvkit.py reference *cnn --fasta genome.fa -o reference.cnn

Then, I'm running individual samples from this cohort against this reference using this reference using batch mode:

cnvkit.py batch $bam -r $referencecnn --output-dir '.' -p ${task.cpus}

I typically see segments with negative log2 values, and I am concerned that I do not have the right method.

For example: chrom	svstart	svend	log2	depth
chrY	10500	2850583	-24.2234	0.00986237
chrY	2851083	6734179	-3.17169	23.4804
chrY	6735705	14460138	-11.6599	1.43891
chrY	14460638	15817151	-22.8281	0.0683646
chrY	15817251	20628581	-11.4769	1.95446
chrY	20629081	22943009	-20.5391	0.444197
chrY	22943509	28783629	-10.1198	0.195285
chrY	58968156	59363066	-24.2333	0.0549595

The scatter plot looks like this:

The inferred copy number is always zero, as expected for chrY for female, but I expected log2 values ~0, instead they are large and negative. Thanks for any insight and for a great tool!

etal commented 6 years ago

Any sequencing reads that map to chrY samples can be treated as noise, either misaligned or amgibuously aligned to pseudogenes or the pseudoautosomal region.

Theoretically the log2 read-depth ratio on chrY for female samples normalized to a female reference is log(0/0) = NaN. CNVkit also fills in missing log2 values with -20 (which then drifts a bit after GC correction and re-centering).

For practicality, CNVkit makes chrY haploid in a "female" reference, so regardless of the reference gender, chrY will show log2 values around 0 for male samples some arbitrary negative number for female samples -- the values that you see here are typical.

If your cohort is all female, it would be reasonable to just delete chrY from either your reference or each of your samples. Then chrY wouldn't show up in the resulting plots and tables.

JspSrs commented 6 years ago

Sorry,

Theoretically speaking is treating reads on Y in a female vs female-ref as noise rather correct. To my opinion one should be more strict and then that idea is only true for spurious reads (i.e. derived from capturing with WES due to affinity or contamination-like in WGS), mapping to non-PAR. Females with Y-sequences do exist, e.g. via a Disorder of (Sexual) Development. Or ‘normal’ females carrying a segment as a gain on another chromosome.

PAR1 and -2 can often be handled in software, although borders do breathe a bit upon underlying technique and software.

So, while indeed it is reasonable to exclude Y from your analysis, I would not do that as hiding it might also hide real leads to results. But prob that depends a lot on your research question, while I am looking from the perspective of diagnostics.

Best, Jasper

From: Eric Talevich [mailto:notifications@github.com] Sent: 16 February 2018 02:36 To: etal/cnvkit Cc: Subscribed Subject: Re: [etal/cnvkit] What are the expected log2 values on chrY for female sample vs female reference (#318)

Any sequencing reads that map to chrY samples can be treated as noise, either misaligned or amgibuously aligned to pseudogenes or the pseudoautosomal region.

Theoretically the log2 read-depth ratio on chrY for female samples normalized to a female reference is log(0/0) = NaN. CNVkit also fills in missing log2 values with -20 (which then drifts a bit after GC correction and re-centering).

For practicality, CNVkit makes chrY haploid in a "female" reference, so regardless of the reference gender, chrY will show log2 values around 0 for male samples some arbitrary negative number for female samples -- the values that you see here are typical.

If your cohort is all female, it would be reasonable to just delete chrY from either your reference or each of your samples. Then chrY wouldn't show up in the resulting plots and tables.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/etal/cnvkit/issues/318#issuecomment-366120247, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AIfPvscF-3l7vIFFYq9LPG3inG6Guvuaks5tVNudgaJpZM4SHa_L.

rbatorsky commented 6 years ago

Thanks very much for the helpful response. I just have a clarifying question about "CNVkit makes chrY haploid in a 'female' reference". From the documentation sex.rst, I understand that by default chrX is considered diploid, and male reference samples have coverage on X doubled to resemble a diploid X. How are coverages scaled in female samples to resemble haploid Y? Are depth values in chrY bins doubled? Thanks again.

etal commented 6 years ago

When building the reference, the chrY values from apparent female control samples are all replaced with -1. This makes it possible to normalize a chromosomally male test sample (i.e. any containing a real chrY) to a reference built from all chromosomally normal female samples, and addresses @JspSrs's other caveats.

I agree it's surprising, but the alternatives all seem worse in one way or another. In the development/upcoming version of CNVkit, the documentation no longer refers to the reference "sex", and instead just describes the option of whether chrX should be haploid in the reference.

etal / cnvkit

What are the expected log2 values on chrY for female sample vs female reference #318