Closed sup3rgiu closed 8 months ago
Hi, This looks absolutely amazing! Thank you so much for your effort! I can tell that your implementation is really smart and your coding style is much better than mine. I will check your implementation and run a few tests soon, and then merge this into the main branch.
Hi, I tested the code using a hg19 test dataset (https://github.com/Cloufield/gwaslab/blob/main/examples/toy_data/to_harmonize.tsv) but I noticed for some cases the results(STATUS codes) are not the same for the last few rows. Probably the part for assigning status needs some revision? And I am thinking for now, we can add this as a new mode (like mode="vector") for further compatibility testing, instead of overwriting the original implementation. How do you think?
Original version:
Vectorized version:
Hi, Nice catch! The problem was that there was a bug if the sumstats dataframe didn't contain all the chromosomes contained in the Fasta file. I fixed it in the last PR and now it should be even faster when the dataframe contains only some chroms. Let me know!
Also, I agree that could be useful to keep both implementations for now.
I confirmed the new commit works perfectly for the test dataset. After merging, I will change some codes to make this an option in check_ref for now. Thank you for your valuable contribution!
Currently, the
hm_harmonize_sumstats.checkref
relies on Pandas.apply
method to call thecheck_status
function on each row of the sumstats dataframe, resulting in a rather high time complexity. The same operation can be done in a vectorized way, using only numpy operations. With the proposed PR, we can get an x15 time improvement, going from ~15 min down to ~1 min for such harmonization step:Original:
Modified:
The core idea is to first load all sequences contained in the Fasta file and convert them to integers in a super fast way. Then we can also convert NEA and EA columns into integer matrices. At this point we can do all the checks and flips in a vectorized way using numpy operations.
Note that I've commented the code heavily to make it as clear as possible.
Also, I have tested the proposed PR against 10 different input files, and the results (i.e., the Sumstats dataframe) are equivalent to those obtained using the untouched implementation. Of course I invite you to do further tests even on your side.