freeseek / gtc2vcf

Tools to convert Illumina IDAT/BPM/EGT/GTC and Affymetrix CEL/CHP files to VCF
MIT License
140 stars 24 forks source link

Pseudo-autosomal regions (PAR). #72

Open rajwanir opened 1 month ago

rajwanir commented 1 month ago

The pseudo-autosomal regions is often annotated in the Illumina's CSV manifest as XY chrom. gtc2vcf probably recode them as chrom X in the output vcf:

https://github.com/freeseek/gtc2vcf/blob/224e7c60b81188342a029ec89f3777537fa7b4f6/gtc2vcf.h#L138-L143

However, it may be strongly encouraged to realign to the reference genome as emphasized in the documentation. If Illumina's CSV manifest is used directly, the output accuracy relies on the Illumina's CSV manifest. Sometimes this PAR may not be correctly annotated in the CSV manifest and the SNPs may actually be onto unique regions on the Y chrom.

For example, in the GSA chip ~80+ SNPs are annotated as XY which actually are actually located on unique regions on the Y chrom.

A few snps from the input CSV manifest:

rs10465468,XY,92708060 rs112096861,XY,92541266 rs12401272,XY,3211973 rs185597746,XY,92386542 rs188145685,XY,91773744

In the output vcf records:

rs10465468 chrX 92708060 rs112096861 chrX 92541266 rs12401272 chrX 3211973 rs185597746 chrX 92386542 rs188145685 chrX 91773744

However, all these SNPS appeear outside the PAR region ((https://useast.ensembl.org/info/genome/genebuild/human_PARS.html) and onto unique region of the Y chrom (e.g. https://ncbi.nlm.nih.gov/snp/rs10465468 ). If the realignment workflow is chosen, the SourceSeq uniquely maps to Y chrom and corrects it. An additional note on this is that if the SNPS indeed lie within PAR region, under the realignment workflow it will still be annotated as X chrom since the PAR regions is hardmasked on Y chrom.

Thought to write here for the interest of any other user who runs into this observation.

.

freeseek commented 1 month ago

Marker rs112096861 does indeed belong to XY as it is part of the XTR region which is shared between chromosome X and chromosome Y. However, the other 4 markers do seem to be mislocalized in the Illumina manifest files. The array intensities indicate that they are not tagging a polymorphic variant anyway, but either the localizations in the manifest file or the source sequences are completely wrong