Illumina / Cyrius

A tool to genotype CYP2D6 with WGS data
Other
46 stars 5 forks source link

troubleshooting "no call" outputs for samples with coverage above 30X #19

Closed mgonzalezporta closed 2 years ago

mgonzalezporta commented 2 years ago

Hi Xiao,

We've come across a few cases where Cyrius reports a no call, in spite of the sample having coverage > 30X. Would you have some recommendations to troubleshoot this further (e.g. not possible to resolve star alleles)?

Attaching a couple of examples: Archive.zip

Thanks

xiao-chen-xc commented 2 years ago

Hi Mar, here are some of my thoughts on these two samples. SAM3877: this is a rare SV combination so Cyrius does not recognize it. My best guess for the genotype is 10+36/13. There is a fusion duplication on one allele and a fusion deletion on the other, so rare that Cyrius has not seen this before so it makes a no-call. SAM3885: My best guess for the genotype is 10/10+36. One haplotype is a rare form of 10 that Cyrius does not recognize (it lacks one variant that most 10s have).

mgonzalezporta commented 2 years ago

Thanks Xiao,

FYI, here the calls inferred from additional tools, also inconsistent with each other: Sample Cyrius Aldy StellarPGx
SAM3877 No call 1/36+*10 1/36x2+*10
SAM3885 No call 10/36+*10 10/10x2

So noted that a subset of samples will need manual follow up.

Happy for the ticket to be closed.

xiao-chen-xc commented 2 years ago

After seeing the depth issue in the other ticket #18 , I took another look at SAM3877. The problem seems to be similar to SAM3865. The MAD is a bit on the high side, and d67_snp_raw looks off from integer values. This suggests that the D6+D7 CN call may be off (Cyrius called 4 and it could be 5; a CN of 5 makes d67_snp_raw closer to integer values, see two plots below). The genotype should be 1/36+*10 if the total CN is 5, making it consistent with Aldy.

image image

These two samples make me wonder if there is a systematic problem in your samples, e.g. if there is some alignment problem that makes D6/D7 regions lower coverage than other parts of the genome. Are your samples all processed using the same library prep/pipeline? Or is there anything specific about these failing samples? If you plot out Total_CN_raw across a large number of your samples, do you see them falling close to integer values or is there a shift towards lower values? I might be over-thinking but could be good to check.

mgonzalezporta commented 2 years ago

Hi Xiao,

Still compiling a larger dataset analysed using DRAGEN 3.7. Will re-check trends there and re-open if still relevant.

Thanks