Closed j2moreno closed 4 years ago
The primary issue is that we discovered there is a second form of the phrase Re-activated
which does not contain a hyphen (reactivated
) in the SNPHistory.bcp.gz file.
The SNPHistory logic was derived from the UM example of LiftRsNumber.py which only searches for Re-activ
. Here is an example of code in snptk
which does similar albeit case-insensitive:
Exploring this further we wrote a small script which categorizes all the different variations of the strings mentioning reactivation and deletion.
#!/usr/bin/env python3
import gzip
import re
import sys
def main(argv):
db = {}
snphist = sys.argv[1]
with gzip.open(snphist, 'rt', encoding='utf-8') as f:
for line in f:
line = line.rstrip("\n")
m = re.findall(r"(re-?activ\w+|del\w+)", line, flags=re.IGNORECASE)
if m:
k = ":".join(m)
db.setdefault(k, []).append(line)
print()
print("Unique activate/delete combinations/count:")
for k in sorted(db):
print(f"{k}: {len(db[k])}")
$ python3 tmp/t.py SNPHistory.bcp.gz
Unique activate/delete combinations/count:
Re-activated: 1658
reactivated: 9597
reactivated:deleted: 179
reactivated:deleted:reactivated: 5335
reactivated:deleted:reactivated:deleted: 1408
reactivated:deleted:reactivated:deleted:reactivated: 2263
reactivated:deleted:reactivated:deleted:reactivated:deleted: 35
reactivated:deleted:reactivated:deleted:reactivated:deleted:reactivated: 859
reactivated:deleted:reactivated:deleted:reactivated:deleted:reactivated:deleted: 14
reactivated:reactivated: 1
We investigated the meaning/accuracy of the deleted
phrase found in some of the SNPHistory entries.
For a randomly selected entry with reactivated:deleted
, e.g:
281865249 2012-11-23 12:32:00.0 2012-11-23 13:36:00.0 2012-11-26 18:28:00.0 | reactivated at Nov 26 2012 6:27PM|deleted on Dec 14 2012 2:04PM
The SNP is indeed deleted with no entry found at:
However selecting another:
63750751 2008-06-09 12:18:00.0 2012-10-04 10:41:00.0 2012-10-10 12:52:00.0 | reactivated at Oct 10 2012 12:51PM|deleted on Nov 15 2012 3:05PM
We find that the SNP exists in dbSNP contradicting what we would conclude reactivated|deleted
means:
We will no longer be using snphistory for snptk logic. Closing issue for now.
We are not checking for reactivated snps inside of SnpHistory. In consequence we are throwing away perfectly good snps that are still active when using snptk.
Example from gwas QC pipeline, snp rs80358238 is being deleted even though it has been reactiavted and double checked on the ncbi dbsnp site that this snp is active.