USF-HII / snptk

USF HII SNP Toolkit - Analyze and translate SNP entries using NCBI dbSNP and related databases
GNU General Public License v3.0
0 stars 1 forks source link

Add functionality to look for reactivated snps in SnpHistory #3

Closed j2moreno closed 4 years ago

j2moreno commented 4 years ago

We are not checking for reactivated snps inside of SnpHistory. In consequence we are throwing away perfectly good snps that are still active when using snptk.

3039851 2002-01-05 23:44:00.0   2013-10-17 21:26:00.0   2013-11-18 14:52:00.0   SNP-6191| reactivated at Oct 27 2014  3:22PM|deleted on Oct 27 2014  3:25PM| reactivated at Oct 27 2014  3:25PM|deleted on Oct 27 2014  3:26PM| reactivated at Oct 27 2014  3:26PM|deleted on Dec 17 2014  4:44PM| reactivated at Dec 17 2014  4:44PM   2014-12-17 16:44:00.0
9274168 2003-10-31 16:50:00.0   2012-10-01 14:42:00.0   2013-01-09 20:25:00.0   | reactivated at Jan 11 2013 10:04PM    2013-01-11 22:04:00.0
9274426 2003-10-31 16:50:00.0   2014-08-08 13:17:00.0   2014-08-08 13:29:00.0   | reactivated at Jun 11 2015  2:36PM    2015-06-11 14:37:00.0
10524523    2003-12-05 18:44:00.0   2012-11-28 08:02:00.0   2013-10-17 13:38:00.0   SNP_6119| reactivated at Oct 22 2014  3:09PM    2014-10-22 15:10:00.0
10590816    2003-12-05 18:44:00.0   2011-07-08 11:27:00.0   2013-10-17 13:38:00.0   SNP_6119| reactivated at Jul 17 2014  5:23PM    2014-07-17 17:24:00.0
17878362    2004-12-02 16:04:00.0   2013-10-17 21:25:00.0   2013-11-18 14:52:00.0   SNP-6191| reactivated at Jul 20 2014 11:39PM    2014-07-20 23:40:00.0
28929471    2005-05-24 16:21:00.0   2012-07-25 09:46:00.0   2012-08-30 16:31:00.0   | reactivated at Aug 30 2012  4:31PM    2012-08-30 16:31:00.0
28931568    2005-05-24 16:21:00.0   2012-07-25 09:46:00.0   2012-08-30 16:31:00.0   | reactivated at Aug 30 2012  4:31PM    2012-08-30 16:31:00.0
28931577    2005-05-24 16:22:00.0   2012-09-29 12:41:00.0   2012-11-29 20:30:00.0   | reactivated at Nov 29 2012  8:29PM|deleted on Jan  4 2013  9:21PM| reactivated at Jan  4 2013  9:21PM 2013-01-04 21:21:00.0
28931609    2005-05-24 16:22:00.0   2012-07-25 09:46:00.0   2012-08-30 16:31:00.0   | reactivated at Aug 30 2012  4:31PM    2012-08-30 16:31:00.0

Example from gwas QC pipeline, snp rs80358238 is being deleted even though it has been reactiavted and double checked on the ncbi dbsnp site that this snp is active.

$ zgrep 80358238 /shares/hii/bioinfo/ref/ncbi/human_9606/current/SNPHistory.bcp.gz.d/*
/shares/hii/bioinfo/ref/ncbi/human_9606/current/SNPHistory.bcp.gz.d/20:80358238 2010-02-04 13:10:00.0   2012-09-29 12:41:00.0   2013-04-05 10:54:00.0   | reactivated at Apr  5 2013 10:54AM    2013-04-05 10:55:00.0
countdigi commented 4 years ago

The primary issue is that we discovered there is a second form of the phrase Re-activated which does not contain a hyphen (reactivated) in the SNPHistory.bcp.gz file.

The SNPHistory logic was derived from the UM example of LiftRsNumber.py which only searches for Re-activ. Here is an example of code in snptk which does similar albeit case-insensitive:

https://github.com/USF-HII/snptk/blob/3cbff549f82033353919e45379533791f03f51a6/snptk/core.py#L111-L113

Exploring this further we wrote a small script which categorizes all the different variations of the strings mentioning reactivation and deletion.

#!/usr/bin/env python3                                    

import gzip                                               
import re                                                 
import sys                                                

def main(argv):                                           
    db = {}                                               

    snphist = sys.argv[1]                                 

    with gzip.open(snphist, 'rt', encoding='utf-8') as f: 
        for line in f:                                    
            line = line.rstrip("\n")    
            m = re.findall(r"(re-?activ\w+|del\w+)", line, flags=re.IGNORECASE)
            if m:                                         
                k = ":".join(m)                           
                db.setdefault(k, []).append(line)         

    print()                                               
    print("Unique activate/delete combinations/count:")   
    for k in sorted(db):                                  
        print(f"{k}: {len(db[k])}")   
$ python3 tmp/t.py SNPHistory.bcp.gz

Unique activate/delete combinations/count:
Re-activated: 1658
reactivated: 9597
reactivated:deleted: 179
reactivated:deleted:reactivated: 5335
reactivated:deleted:reactivated:deleted: 1408
reactivated:deleted:reactivated:deleted:reactivated: 2263
reactivated:deleted:reactivated:deleted:reactivated:deleted: 35
reactivated:deleted:reactivated:deleted:reactivated:deleted:reactivated: 859
reactivated:deleted:reactivated:deleted:reactivated:deleted:reactivated:deleted: 14
reactivated:reactivated: 1
countdigi commented 4 years ago

We investigated the meaning/accuracy of the deleted phrase found in some of the SNPHistory entries.

For a randomly selected entry with reactivated:deleted, e.g:

281865249       2012-11-23 12:32:00.0   2012-11-23 13:36:00.0   2012-11-26 18:28:00.0   | reactivated at Nov 26 2012  6:27PM|deleted on Dec 14 2012  2:04PM

The SNP is indeed deleted with no entry found at:

However selecting another:

63750751        2008-06-09 12:18:00.0   2012-10-04 10:41:00.0   2012-10-10 12:52:00.0   | reactivated at Oct 10 2012 12:51PM|deleted on Nov 15 2012  3:05PM     

We find that the SNP exists in dbSNP contradicting what we would conclude reactivated|deleted means:

j2moreno commented 4 years ago

We will no longer be using snphistory for snptk logic. Closing issue for now.