freeseek / gtc2vcf

Tools to convert Illumina IDAT/BPM/EGT/GTC and Affymetrix CEL/CHP files to VCF
MIT License
138 stars 24 forks source link

Drop marker if SourceSeq maps to different loci equally well. #74

Open rajwanir opened 1 month ago

rajwanir commented 1 month ago

I see that typically gtc2vcf does a fantastic job in infering the coordinates with the SourceSeq mapping workflow. However, I note that in some cases when SourceSeq maps to different loci equally well, one of the mapping gets picked up by gtc2vcf instead of dropping the marker.

Here is an example for rs10435524 which maps to chr8:30404042. The SourceSeq is below:

TTCACTGGATACAAAAATGCTTAGCAGATAATTTCTGGGGGTGTCATTCTTGAGTTTATC[A/G]GACACCGTGAAGTGTGTTGCTTTTTGTGTGTTAGGTGCTTGCTATATTTTTCTGGCTATT

The bwa alignment of the SourcSeq looks like this:

rs10435524-138_B_R_2276283903:1 0 chr8 30403982 0 121M 0 0 TTCACTGGATACAAAAATGCTTAGCAGATAATTTCTGGGGGTGTCATTCTTGAGTTTATCAGACACCGTGAAGTGTGTTGCTTTTTGTGTGTTAGGTGCTTGCTATATTTTTCTGGCTATT NM:i:1MD:Z:60G60 AS:i:116 XS:i:116 XA:Z:chrX,+37095221,121M,1; rs10435524-138_B_R_2276283903:2 0 chrX 37095221 0 121M 0 0 TTCACTGGATACAAAAATGCTTAGCAGATAATTTCTGGGGGTGTCATTCTTGAGTTTATCGGACACCGTGAAGTGTGTTGCTTTTTGTGTGTTAGGTGCTTGCTATATTTTTCTGGCTATT NM:i:0MD:Z:121 AS:i:121 XS:i:121 XA:Z:chr8,+30403982,121M,0;

The gtc2vcf outputs the mapping as chrX:37095281. The SourceSeq in this case seems not uniquely mapping to the reference and I think can be filtered out. I see that something along these lines is already implemented in gtc2vcf, however, I still see markers like this slip through. Do you know why?

I have additional examples like this. Overall a small percentage of all markers of course but still appears inaccurate and wanted to see if we can fix it.

Thanks.

freeseek commented 1 month ago

The SourceSeq mapping workflow in BCFtools/gtc2vcf will align the two alleles with the flanking sequences and then select the alignment that best matches the reference. In this case, the alignment score (AS) is 116 for the first alignment and 121 for the second alignment so the second alignment is preferred. On average this is the best strategy but in this case you get an alignment that differs from what dbSNP selected. In this specific scenario, it seems like in GRCh38 chromosome 8 sequence was incorrectly inserted on chromosome X and this was eventually fixed with GRCh38 patch chrX_ML143385v1_fix. But it is not incorrect to say that rs10435524 maps to chrX:37095281 in GRCh38. The problem is that GRCh38 contains duplicate sequence on chromosome X derived from fosmid clone ABC8-42664800J17/FO681500.1 which should not have been there in the first place. There is no way the SourceSeq mapping workflow can take all of this into account

rajwanir commented 1 month ago

Thanks @freeseek . Quite an edge case! On a GSA chip, I see over a hundred markers end up like this.

One solution I came across is to create a second alignment mapping with bwa fastmap -l 121 and creating a list of qnamesthat has more than one identical mapping across reference. Subsequently, setting alignments with these qnames as unmap in the standard bwa alignment sam file before providing it to gtc2vcf. This essentially improves the overall accuracy and consistency with dbSNP since such duplicated regions would be filtered out in VCF. However, it does leads a little more skipped markers. An alternate soultion could be that instead of filtering out at gtc2vcf, I could add a INFO/tag for multipe-sourceseq-mapping as annotation to take care of it later.

freeseek commented 1 month ago

Not all flanking sequences have the same length so using -l 121 would not be appropriate. Providing the alternate mapping as an INFO field might be of value but this would require a large overhaul of the code as the code to convert a manifest file to VCF and the code to update a manifest file using a SAM/BAM file are two separate black boxes. Furthermore, the goal here is not consistency with dbSNP. There are many examples of markers that are incorrectly localized by dbSNP (see Genovese et al. 2012 for example) and this is accepted as dbSNP uses essentially a similar strategy to assign marker location. Furthermore, dbSNP is plagued by other issues such as this so I would not rely on dbSNP coordinates