USF-HII / snptk

USF HII SNP Toolkit - Analyze and translate SNP entries using NCBI dbSNP and related databases
GNU General Public License v3.0
0 stars 1 forks source link

Entries with AltOnly in SNPChrPosOnRef added to delete list #17

Open countdigi opened 4 years ago

countdigi commented 4 years ago

Although there is code to load AltOnly entries (AltOnly in second position field) from SNPChrPosOnRef into the dbsnp dict here:

https://github.com/USF-HII/snptk/blob/521033ff24acf681149a3b9312bccdcba6614098/snptk/core.py#L148-L149

These entries are not being populated in the dsnpb dict and therefore are scheduled for deletion.

The first Reference SNP ID we noticed this for was rs2517878 :

$ zgrep -m 1 -w 2517878 /shares/hii/bioinfo/ref/ncbi/human_9606_b151_GRCh37p13/b151_SNPChrPosOnRef_105.bcp.gz.d/00
2517878 AltOnly

Explanation for the observed behavior:

  1. We recently modified SNPTk to no longer use the SNPHistory file and instead schedule Reference SNP IDs for deletion based on their presence in dbsnp since we load every Reference SNP (before and after merging) into a dict.
  2. AltOnly entries have blank tab-separated lines after the second field, and the strip() command is removing them so that we are left with only 2 fields: https://github.com/USF-HII/snptk/blob/521033ff24acf681149a3b9312bccdcba6614098/snptk/core.py#L138
  3. A few lines below, we test for at least 3 fields and since the fields were stripped we never add the Reference SNP ID to the dbsnp dict (also we check the third field is not blank which will be True for AltOnly entries): https://github.com/USF-HII/snptk/blob/521033ff24acf681149a3b9312bccdcba6614098/snptk/core.py#L142-L143