etal / cnvkit

Copy number variant detection from targeted DNA sequencing
http://cnvkit.readthedocs.org
Other
545 stars 165 forks source link

Optionally consider PAR1/2 as diploid in male samples due to reference genome masking #784

Closed rollf closed 1 year ago

rollf commented 1 year ago

Hi,

I'd like to address a problem that was already mentioned by @gabeng in this issue. The pseudoautosomal regions in chr Y are masked (by N) in hg38 and it also common practice to do so in hg19/GRCh37-based pipelines (see for reference this GATK blog entry). In consequence, the coverage of male samples within the PAR1/2 on chromosome X appears 'doubled'. For example, I have a couple of whole exome samples where one can clearly see this based on the cnn coverage file:

male01 targetcoverage cnn log2 X0-2500 male01 targetcoverage cnn depth X0-2500

The plots show the first 2500 target regions. In contrast, see a female sample:

female01 targetcoverage cnn depth X0-2500


Feeding a male-only pool through the usual cnvkit workflow (reference, fix, segment), the previously doubled coverage in PAR1 within the male samples gets lost. The sample has log2 ~ -1 across the whole chromosome.

male01 cnr master log2 X0-2500

Furthermore, feeding a mixed pool (5 male, 5 female samples) into the pipeline, the PAR1 region in both samples gets biased. Male samples have log2 ~ -0.5 (i.e. higher than in the male-only pool) and female samples have log2 ~ -0.5 (i.e. lower than in a female-only pool), too (in PAR1 only):

male01 mixed cnr master log2 X0-2500 female01 mixed cnr master log2 X0-2500


All of this is quite understandable as chromosome X is simply considered haploid for males samples. This is of course correct from a genetics point of view. However, technically, the healthy male samples are expected to be diploid within PAR1/2 because all reads from chr Y in PAR1/2 are aligned to chr X (due to the masking). The results above make it hard to diagnose copy number aberrations within PAR1/2. For testing purposes, I have implemented some changes in current code base of cnvkit to consider PAR1 as diploid. Namely, GenomicArray.autosomes() includes PAR1 and sex-related code paths referring to chr X exclude PAR1. This yields the results as expected from the coverage data. Male samples have log2 ~ 0 for PAR1 (for both male-only and mixed pools) and female samples stay at log2 ~ 0 in PAR1 even for mixed pools:

male01 mixed cnr xwop log2 X0-2500 female01 mixed cnr xwop log2 X0-2500

I would be interested in integrating this change into cnvkit. This could be an optional feature, of course. I'm happy to provide an initial draft PR in order to discuss this from a more technical point of view. Please let me know if you're interested in this.

rollf commented 1 year ago

Hey @etal,

any feedback on this from your side? As I said, I'm happy to do the actual work given green lights from your side.

etal commented 1 year ago

This makes sense to me. Sure, I'd love to see a PR.

rollf commented 1 year ago

Closing this the issues was solved by #789