freeseek / gtc2vcf

Tools to convert Illumina IDAT/BPM/EGT/GTC and Affymetrix CEL/CHP files to VCF
MIT License
140 stars 24 forks source link

Genomestudio file to vcf #39

Closed PriscillaPoemba closed 2 years ago

PriscillaPoemba commented 3 years ago

Dear Freeseek,

The conversion from a genomestudio file to a vcf file works fine, but a lot of SNPs are missing after this conversion. I looked into this and observed that only the SNPs without any missings are in the vcf file, but I am not sure about this yet, so I have some questions about this.

Is it true that the gtc2vcf tool only keep the complete SNPs without any missings after conversion? Or is there another way to handle them in this tool? And is it right if I use -- for missings in the Genomestudio file?

Thanks in advance!

freeseek commented 3 years ago

bcftools +gtc2vcf will try to keep as many SNPs as possible. The only SNPs that should be dropped are those that have no localization as the VCF specification does not allow SNPs without chromosomal or position assignment. Can you show an example of a SNP in GenomeStudio that is not converted to VCF?

PriscillaPoemba commented 3 years ago

Yes, the following SNPs are examples:

cnvi37491436 - Y - 27504612
cnvi37517080 - Y - 28786812 cnvi37517006 - Y - 28783112 cnvi37516911 - Y - 28778362 cnvi37516747 - Y - 28770162 rs1002204 - 7 - 87141497 rs10026610 - 4 - 16462250 rs1003590 - X - 48818436

All SNPs are with chromosomal and position assignment, and it is also not true that all cnvi SNPs are missing after conversion or all SNPs on chromosome Y or X.

freeseek commented 3 years ago

So I believe the issue is that your chromosomes are encoded as Chr# rather than chr# and so they don't match the chomosomes in the fasta file. I had never seen this type of encoding in a GenomeStudio table before. I think Illumina is excessively liberal in the way it encodes chromosomes. Can you download the gtc2vcf development binaries here and see if that fixes your issue? Make sure that you can confirm it is version 2021-08-05 or newer when you run it. You can run a plugin binary with the following syntax: bcftools +$PWD/gtc2vcf.so

PriscillaPoemba commented 3 years ago

The chromosomes are coded without the Chr in the GenomeStudio file, like Y or 4, but in the message it was to make clear I mean the chromosome of the SNP. I'm sorry for the confusion.

freeseek commented 3 years ago

This is not helpful. Can you show me an example I can use to reproduce the problem on my end? A single line from the GenomeStudio table (including the header line) shall suffice.

PriscillaPoemba commented 3 years ago
Chromosome Position IlmnStrand SNP Name Sample1.GType Sample1.Log R Ratio Sample1.B Allele Freq Sample2.GType Sample2.Log R Ratio Sample2.B Allele Freq Sample3.GType Sample3.Log R Ratio Sample3.B Allele Freq Sample4.GType Sample4.Log R Ratio Sample4.B Allele Freq
3 183635768 TOP [A/G] rs1000002 AB 0.0 0.509009009009 AA 0.0 0.0125 AB 0.5279315556849999 0.734234234234 AB -0.06871275008399999 0.522522522523
4 23626018 BOT [T/C] rs10001239 -- -- -- -- -- -- -- -- -- -- -- --
freeseek commented 3 years ago

So the problem is that the parser fails when trying to read the Log R Ratio (LRR) and the B Allele Freq (BAF) if these are represented as "--", which is more like a string reserved for missing genotypes. There is no good reason to not report LRR and BAF for sites with missing genotypes. I have never seen anything like this but, since there is no specification for GenomeStudio files, I do not know if this should be an allowed GenomeStudio table. Could you explain to me how you generated such a table?

PriscillaPoemba commented 3 years ago

I coded the missing LRR and BAF values as NA instead of -- and more SNPs are present after conversion now, thank you!

We have 3 batches with exact the same SNPs before conversion, but the number of SNPs after conversion is still different in the batches. Any suggestion why this is happening?

freeseek commented 3 years ago

I think you need to find examples of markers that failed to convert. You can use bcftools +gtc2vcf --verbose --genome-studio to get information about why markers drop.

PriscillaPoemba commented 3 years ago

All failed SNPs gave the same message: Failed to process marker cnvi0111205. What does this mean?

freeseek commented 3 years ago

It seems weird that you have multiple markers with the same cnvi0111205 ID in the same GenomeStudio table. That said, it would probably mean that there is something off about the line with that marker. Maybe, again, some of the numbers that should be represented as float were improperly encoded as such. Without knowing how you generated the table or without seeing the actual line that is causing the issue it is impossible to guess what is wrong.