Illumina / BeadArrayFiles

Python library to parse file formats related to Illumina bead arrays
46 stars 34 forks source link

Weird Basecall result #1

Closed yyna closed 7 years ago

yyna commented 7 years ago

I am tryinig to transfer from .gtc data to .txt with Reference allele and Alternative allele. So I am using GenotypeCalls and code2genotype module. I got some weird result.

For example, the last snp of GSAMD-24v1-0_20011747_A1 is

index : 700078 chromosome : X position : 99912338 SNP : [T/A] strand : +

When I transfer my .gtc file with this SNP, the result genotype call was BB and basecall was TT. It should be AA, right? I got so many of this kinda strange output. The saddest part is it works for some other rows...... I have no idea.

KelleyRyanM commented 7 years ago

Hi Jungin, Let me see if I can help, but first let me make sure I understand the issue. When you say the basecall is TT, I assume you mean the basecall as provided by the "get_base_calls" function of the "GenotypeCalls" class. This will return the nucleotide allele relative to the TOP strand (as opposed to the design, plus or forward strand). From your description, the locus of interest is "X:99912338-T-A", which is designed on the BOT strand.

So, if the genotype call is BB, the nucleotide allele on the design strand would we AA. Since the design is on the BOT strand, we would need to complement to get the call on the TOP strand, with a final nucleotide call on the TOP strand of TT. Therefore, I believe the expected output is produced in this case.

One potential source of confusion here is the distinction between the TOP strand and the plus strand. The TOP/BOT is an Illumina convention for reporting strand (see http://www.illumina.com/documents/products/technotes/technote_topbot.pdf) The plus strand is the strand relative to the reference genome.

If you prefer to have the call on the plus strand, unfortunately there is not a built-in convenience function in this library (yet). However, if you've already read in the plus/minus strand orientation, the logical will be fairly straightforward. You would combine the SNP information with the AB genotype call to get the nucleotide allele on the design strand, and then complement as necessary if the SNP is designed on the minus strand.

Does that help answer your question?