Illumina / PlatinumGenomes

The Platinum Genomes Truthset
https://illumina.github.io/PlatinumGenomes
84 stars 9 forks source link

NA12877 chrX calls #7

Open ksw9 opened 6 years ago

ksw9 commented 6 years ago

Hi,

We are using the PlatinumGenomes NA12877 resource and are wondering why calls on chrX begin at position 2781986? The corresponding ConfidentRegions.bed.gz begins at: chrX 251053 251087.

Thank you for your help!

blmoore commented 6 years ago

Confident regions need not contain a truth variant, they can also just be regions we're calling homozygous reference — does that answer your question?

ksw9 commented 6 years ago

Hi, thank you, we considered that, however it seems strange that there would be such a long stretch (2.78 Mb) of confident homozygous ref calls on the X chromosome. Especially compared to variation distribution on other chromosomes. We’re wondering if the X calls only occur on portions of the chromosome. Thank you!

On Mon, Aug 6, 2018 at 2:02 AM Benjamin L. Moore notifications@github.com wrote:

Confident regions need not contain a truth variant, they can also just be regions we're calling homozygous reference — does that answer your question?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Illumina/PlatinumGenomes/issues/7#issuecomment-410639139, or mute the thread https://github.com/notifications/unsubscribe-auth/AHDof39x-RaShj-axm9uAt94-QdO2g7zks5uOAYOgaJpZM4VueQD .

eberle commented 6 years ago

I think that what you are talking about is the PAR region. For variants, we "validate" the call based on the genotypes agreeing with the inheritance. Thus in males, the genotypes will end up being a combination of chrX & chrY and likely most variants will show up as heterozygous which will automatically fail them in our consistency check. I'm guessing that this is what you are seeing. I should point out that this can happen in other parts of the genome where there is a CNV - we can identify positions that are definitely reference but the variants may disagree with the pedigree check so we fail most of the variants. There is a discussion of this in the manuscript. What you should be seeing is that the confident region is not a 2.78Mb long region but a series of smaller blocks and many of these blocks are broken up where there are SNVs and indels. Does this agree with what you are seeing?

erika8 commented 5 years ago

Hi,

We are also using the platinum genomes and noticed a big difference in the size of the confident regions of the X chromosome on hg38 (i.e. 2,477,045 bp) compared to hg 19 (i.e. 137,716,288 bp). We also checked the data of Genome in a bottle (GIAB) and the size is in the same range for both builds: hg19: 137,156,694 bp; hg38: 109,267,367 bp. Do you know why there is a such difference between both builds for platinum data? Chromosome X is quite important and the advantage of platinum genomes over GIAB is that we have two cell lines. Is there a way to solve this? Thanks!

eberle commented 5 years ago

Hi @erika8,

Thanks for using this resource. I think that I know what has happened. One of our requirements is that a "confident" region must be called across the pedigree and I males are more likely to not have a "PASS" genotype due to lower depth on chrX. I think that our callers with hg19 were sex-aware for calling the homozygous reference positions and thus the higher numbers. We are looking into this now to confirm what is happening and will work to fix this.

Cheers,

-Mike

erika8 commented 5 years ago

Hi Mike,

Thanks for your feedback, I'm looking forward to the fix!

Cheers

Erika

ksw9 commented 5 years ago

Hi, Thanks for all your work on this and apologies for the late response to your above help. Yes, we are very interested in the differences between the chrX calls made to the two builds as well. Any ideas? Is there a loss of quality when using the calls made to hg19 which contain longer confident regions?

To clarify your above explanations - do you mean that the PAR regions can't contain high quality variants, but can contain high quality ref calls? So the confident regions there will contain long stretch of ref allele, interrupted where variants are called.

Thanks again for your help! Best,

ksw9 commented 5 years ago

Hi, I just wanted to follow up - are the chrX hg19 truth VCF calls reliable? Thank you!