Multiple dbSNP ids - Githubissues

jdelafon commented 9 years ago

Many variant entries in the database (v.0.16.3) have multiple dbSNP IDs, and in most cases only one is valid and the other one "was merged into [the former]", when it is not even wrong (?). Ex :

sqlite> select chrom,start,ref,alt,gene,rs_ids from variants if start='866510';
chr1|866510|C|CCCCT|SAMD11|rs375757231,rs386352943,rs386417774,rs60722469

http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=rs375757231 http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=rs386352943 -> rs375757231 http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=rs386417774 -> rs375757231 http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=rs60722469

The last one does not seem right, either. Based on the dbsnp links above,

rs375757231 : GRCh37.p13 105 1 866511:866512 NT_004350.19 rs60722469 : GRCh37.p13 105 1 866526:866527 NT_004350.19

Am I missing something ? Is it intended to keep older ids (maybe they could be in a separate synonym_rs_ids column) ? Ideally, one mutation should correspond to one ID - otherwise they should revise their ID system. Actually, if I could safely assume that the first one in the list is the only one I should consider, that would be fine for me. But it is not the case.

brentp commented 9 years ago

hi @muraveill , this is indeed confusing. Here is the entry for 866511 (1-based) for the normalized dbsnp:

$ zgrep -wm4 866511 dbsnp.b141.20140813.hg19.tidy.vcf.gz
1   866511  rs375757231 C   CCCCT   .   .   RS=375757231;RSPOS=866511;dbSNPBuildID=138;SSR=0;SAO=0;VP=0x050000080015000002000200;WGT=1;VC=DIV;INT;OTH;ASP;OTHERKG
1   866511  rs386352943 C   CCCCT   .   .   RS=386352943;RSPOS=866511;dbSNPBuildID=138;SSR=0;SAO=0;VP=0x050000080015000002000200;WGT=1;VC=DIV;INT;OTH;ASP;OTHERKG
1   866511  rs386417774 C   CCCCT   .   .   RS=386417774;RSPOS=866511;dbSNPBuildID=138;SSR=0;SAO=0;VP=0x05000008001500000c000200;WGT=1;VC=DIV;INT;OTH;ASP;KGPilot123;KGPROD
1   866511  rs60722469  C   CCCCT   .   .   RS=60722469;RSPOS=866526;dbSNPBuildID=129;SSR=0;SAO=0;VP=0x050100080005000102000200;WGT=1;VC=DIV;SLO;INT;ASP;GNO;OTHERKG;OLD_VARIANT=1:866514:C/CTCCC

So, the version that we are using has the 2 old variants that have been collapsed into rs375757231. vt normalize (http://www.ncbi.nlm.nih.gov/pubmed/25701572) also changes the position of the lat rs to become 866511. So it is actually the same variant.

We think that requiring normalization, and normalizing our annotation resources is the best way to avoid false negatives, but it does introduce cases such as this.

delafont commented 9 years ago

Indeed from this annotation there is no way to know which is the real, current one...

First, the OLD_VARIANT/dbSNPBuildID flag could be used to filter the last one out. It don't think it can be useful to anybody using a Gemini db, if the information that it is from an older build is not there anymore.

Second, if I had to do it (and I actually will, if it is not fixed), I'd take every ID in the vcf, query their API (the urls above), check the content of the response to see if it is one of these "merged variant" pages, and flag variants in my dbsnp vcf accordingly. Only once, to generate the file that users download. Then run Gemini with this preprocessed dbsnp source, creating a column "rs_id" for the real one, and "synonym_rs_ids" for the useless ones - just in case someone really wants to query variants given a list of outdated rs_ids. I could submit a pull request in that direction if it makes any sense to you as well.

arq5x / gemini

Multiple dbSNP ids #559