Illumina / GTCtoVCF

Script to convert GTC/BPM files to VCF
Apache License 2.0
41 stars 31 forks source link

GTC parser error : All GT values in converted vcf are './.' #48

Closed gezthxbio closed 5 years ago

gezthxbio commented 5 years ago

I tried to convert all GTC files generated from Illumina Iscan using Beeline software to VCF

gtcpath=203239960044_gtc
outpath=~/P/Iscan/ISCP2
python2.7 /mnt/EdicoNAS/Crepo/GTCtoVCF-1.1.1/gtc_to_vcf.py \
        --gtc-paths $gtcpath \
        --manifest-file ../GSA-24v2-0_A1.csv \
        --genome-fasta-file /mnt/EdicoNAS/Crepo/GTCtoVCF-1.1.1/Reference/hg19.fa \
        --output-vcf-path $outpath

File conversion was successful, but all the GT values in all VCF are './.' which I understood from the VCF representation is no call for diploid cases

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 203239960044_R01C01

1 565433 rs9701055 C T . PASS . GT:GQ ./.:0.0 1 567667 rs9651229 C T . PASS . GT:GQ ./.:0.0 1 568208 rs9701872 T C . PASS . GT:GQ ./.:0.0 1 568527 rs11497407 G A . PASS . GT:GQ ./.:0.0 1 727841 GSA-rs116587930 G A . PASS . GT:GQ ./.:0.0

For another set of gtc files after conversion, the following GQ were found

  1. 0/2
  2. 0/3
  3. 2/2 How to interpret these genotypes.

Please help to resolve this issue

jjzieve commented 5 years ago

Are you referring to the GT (genotype) or GQ (genotype quality score)? It does seem that the example you posted has no calls. Which explains the "./." for GT and "0.0" GQ. If you're asking how to interpret the genotypes, it follows VCF 4.1 format https://samtools.github.io/hts-specs/VCFv4.1.pdf. So "0/3" would mean the first allele is the REF and the second allele is the third allele listed for ALT.

gezthxbio commented 5 years ago

Ohh sorry it was a typo, its GT. following is a grep result for '0/3' in some of the vcfs.

203239950122_R01C01.vcf:4       69962449        rs12233719.1,rs12233719.2,rs12233719.3,seq-rs12233719,seq-rs12233719.1,seq-rs12233719.2 G       A,C,T   .       PASS    .
203239950122_R02C01.vcf:11      17483038        rs2074317.1,rs2074317.2,rs2074317.3     C       A,G,T   .       PASS    .       GT:GQ   0/3:0.409735
203239950122_R02C02.vcf:2       219679439       seq-rs72551322,seq-rs72551322.1,seq-rs72551322.2        C       A,G,T   .       PASS    .       GT:GQ   0/3:0.0141896
203239950122_R02C02.vcf:11      5248331 rs33980857,rs33980857.1,rs33980857.2,rs33980857.3       A       T,C,G   .       PASS    .       GT:GQ   0/3:0.0
203239950122_R02C02.vcf:11      5248387 rs33994806.1,rs33994806.2,rs33994806.3,seq-rs33994806   G       A,C,T   .       PASS    .       GT:GQ   0/3:0.00943007
203239950122_R02C02.vcf:15      31222767        seq-rs387907280,seq-rs387907280.1,seq-rs387907280.2     G       A,C,T   .       PASS    .       GT:GQ   0/3:0.0736654
203239950122_R03C02.vcf:2       219679439       seq-rs72551322,seq-rs72551322.1,seq-rs72551322.2        C       A,G,T   .       PASS    .       GT:GQ   0/3:0.0
203239950122_R03C02.vcf:4       69962449        rs12233719.1,rs12233719.2,rs12233719.3,seq-rs12233719,seq-rs12233719.1,seq-rs12233719.2 G       A,C,T   .       PASS    .
203239950122_R03C02.vcf:11      5248244 seq-rs33983205,seq-rs33983205.1,seq-rs33983205.2        T       A,C,G   .       PASS    .       GT:GQ   0/3:0.0831836
203239950122_R05C01.vcf:11      17483038        rs2074317.1,rs2074317.2,rs2074317.3     C       A,G,T   .       PASS    .       GT:GQ   0/3:0.409735
203239950122_R05C02.vcf:11      17483038        rs2074317.1,rs2074317.2,rs2074317.3     C       A,G,T   .       PASS    .       GT:GQ   0/3:0.409735
203239950122_R06C01.vcf:11      17483038        rs2074317.1,rs2074317.2,rs2074317.3     C       A,G,T   .       PASS    .       GT:GQ   0/3:0.409735
203239950122_R10C01.vcf:11      17483038        rs2074317.1,rs2074317.2,rs2074317.3     C       A,G,T   .       PASS    .       GT:GQ   0/3:0.409735
203239950122_R11C01.vcf:4       69962449        rs12233719.1,rs12233719.2,rs12233719.3,seq-rs12233719,seq-rs12233719.1,seq-rs12233719.2 G       A,C,T   .       PASS    .
203239950122_R11C02.vcf:2       241808314       rs34116584.1,rs34116584.2,rs34116584.3  C       A,G,T   .       PASS    .       GT:GQ   0/3:0.482029
203239950122_R11C02.vcf:11      17483038        rs2074317.1,rs2074317.2,rs2074317.3     C       A,G,T   .       PASS    .       GT:GQ   0/3:0.409735

I understand that this is coming from the design itself. But my query is since this is a human data ran on a GSA chip, can 0/3 condition be possible ? <- RUN1 data and ./. was observed for all entries in all vcfs that were converted for a seperate run. <- RUN2 data

gezthxbio commented 5 years ago

Upon querying GSA annotation file and manifest for this rsid 'rs2074317', got three entries, so I guess there are three combinations of SNPs designed to handle 3 different types of alleles for this rsid

rs2074317.1     11      17483038        [A/C]   NM_001287174,NM_000352  ABCC8,ABCC8             Silent,Silent
rs2074317.2     11      17483038        [C/G]   NM_001287174,NM_000352  ABCC8,ABCC8             Silent,Silent
rs2074317.3     11      17483038        [T/C]   NM_001287174,NM_000352  ABCC8,ABCC8             Silent,Silent

So for '0/3' GT like above where ref allele is C C/T or the 3rd allele will be considered.

gezthxbio commented 5 years ago

Tried to parse a gtc file BeadArrayFiles and noticed 'NC' string in most of the entries. I have resolved "./." issue, while converting idat to gtc using beeline software cluster file was not specified, resulting in 'NC' string in gtc files. Did another conversion with cluster file and it resolved.