Run HitWindows only once

Original issue write up from Ava Kelley (20 Jul 2018):

GeneScorePipeline currently operates on a two step approach to running HitWindows on SNPs:

When loading meta files, if there are >1000 meta SNPs, HitWindows is run on these SNPs before continuing. After cross-filtering to a dataset containing only the overlap of SNPs for a study and the meta SNPs remaining after step 1, a HitWindows is run on the cross-filtered SNPs. This approach has a few concerning holes:

The Step 1 HitWindows can choose an index SNP that is not in a data file even when a secondary SNP in the window is in the data file. This will result in that window being lost from analysis when the Step 2 HitWindows is run with all of the matching SNPs pre-filtered by Step 1. When there are >1000 meta SNPs and the Step 1 HitWindows occurs, the second round in Step 2 should never do anything, all of the SNPs will already have at least the extension threshold between them due to being filtered to the index SNPs from the first HitWindows. In the Step 2 HitWindows, if the Step 1 HitWindows was skipped due to number of meta SNPs, windows could be split because of window extensions that don't occur due to a SNP necessary to extend the window not being in the data and therefore already being dropped by cross-filtering. (e.g. if your window extension size is 100k and you have three snps below the threshold that are each 70k apart but the middle of the 3 is missing from your data, the current second HW will just see two snps that are 140k apart and split this window into two windows.) A "correct" approach to selecting SNPs would be to run HitWindows just once for each meta file, holding on to every SNP that was used to create the window, not just the index SNP (not currently how HitWindows operates) and then when cross-filtering to a data file to select from each of those windows the lowest p-value SNP that is in the data.

PankratzLab / GenScorePipeline

Run HitWindows only once #10