Closed rajwanir closed 3 months ago
If you look at the flanking sequences for rs367866237, rs370944521, rs369647419, and rs370821352 on dbSNP you get:
CTTGGGGTTTATACAAGATGCCTGCGGAATGTTGGATATTACATTTGTGA
CAAAATTGTTGGAATTGTGAGCTGGCATGCACTGGACCATTATCAGCTTA
[A/C]
ATTTTTGTGGGCCACCCCAAAAACGCAATAGTTAGAAGAAGATGTTTAAT
GACATCAAATGGACTCTTCAGGAAGAGTATGTGTACTAATTAGGTAAGTG
AGCACTGAAGTTACTTGGCTGGTAGTATGGAATTTCACTCCTGGCTTGCA
GAGCCTATCACAGAGTTTGTGTCTCTGCTGGAAATAGAGACGTTAATCAC
[C/T]
CGCCAAGCATGGTGCTCTTCGGCTAGCACAGAGCAGCTCTCAGGCACTGA
AAAGTATGTGCTTTGGTTTCTTTTGTTACACAGATTGCTCCCTTGTTGTG
GCGGCTTCGACCCTATATCCCCCGCCCGCGTCCCTTTCTCCATAAAATTC
TTCTTAGTAGCTATTACCTTCTTATTATTTGATCTAGAAATTGCCCTCCT
[T/C]
TTACCCCTACCATGAGCCCTACAAACAACTAACCTGCCACTAATAGTTAT
GTCATCCCTCTTATTAATCATCATCCTAGCCCTAAGTCTGGCCTATGAGT
TAAAACCCGCCACATCTACCATCACCCTCTACATCACCGCCCCGACCTTA
GCTCTCACCATCGCTCTTCTACTATGAACCCCCCTCCCCATACCCAACCC
[C/T]
CTGGTCAACCTCAACCTAGGCCTCCTATTTATTCTAGCCACCTCTAGCCT
AGCCGTTTACTCAATCCTCTGATCAGGGTGAGCATCAAACTCAAACTACG
and so you can see that they don't match what is in the Illumina manifest file. Looking at 1000 Genomes high coverage data, it seems like dbSNP has the correct coordinates for the polymorphic variant. I don't have a guess about what the source of the problem here is. GRCh37 and GRCh38 here do not differ at the loci. The flanking sequences from Illumina's manifest file are missing a base pair. Because of that, bwa makes an almost random guess at where the polymorphic base will align and as a result it does not recover the correct base pair for the true polymorphism. But there is no way to make the correct guess with the incorrect flanking sequences provided. The only pragmatic solution here is to fix the manifest file flanking sequences
Thank you @freeseek . It is quite strange observation that exactly one nucleotide adjacent to the SNP is missing in these chrY and chrM SourceSeq. Certainly, no way it can be fixed on the analysis side.
Thank you for looking into it.
Hi @freeseek
Using the SourceSeq mapping workflow, I note that a couple SNPs gets shifted by -1 (relative to dbSNP and orignal Illumina csv_manifest). Any guess on why that might be? Would that be another dbSNP and genome build inconsistency as explained in #74
Here is a list of example SNPs shifted by -1 in gtc2vcf SourceSeq mapping workflow:
The bwa alignment for rs367866237 looks like this:
Thank you.