freeseek / gtc2vcf

Tools to convert Illumina IDAT/BPM/EGT/GTC and Affymetrix CEL/CHP files to VCF
MIT License
140 stars 24 forks source link

CNV and coordinates mapping issue? #64

Open rajwanir opened 5 months ago

rajwanir commented 5 months ago

Hello @freeseek

Is this a known issue that coordinates for CNV loci may be incorrectly/differently presented via SourceSeq mapping workflow?

Since I have large set of chips and samples to analyze, I tried to estimate accuracy of coordinate inference via SourceSeq mapping. I selected a couple chips for which I have hg19-based manifests with both RefStrand column and SourceSeq column. So I could generate the vcf in a stanadard fashion and liftover the vcf to new hg38 ("The liftover approach"). or I could use the SourceSeq column to update manifests to hg38 and the resulting vcf would be based on hg38 ("freeseek/gtc2vcf plugin approach"). I prefer updating manifests to hg38 since it could be useful in absence of RefStrand column and somewhat a more straightforward solution.

Here are the results:

image

I note that majority of the inconsistency between the liftover approach verses plugin were associated with the CNV loci. Is this a known issue? Do you have any thoughts or possible suggestions that to make CNV coordinates more consistent?

Thanks.

freeseek commented 5 months ago

It really depends on the probes so without sampling a few examples and see what is going on it is going to be hard to guess. Maybe CNV probes are more likely to land on segmental duplications and they are more likely to be mismapped which possibly correlates with a lower concordance between the two approaches