Vectorize checkref - Githubissues

sup3rgiu commented 8 months ago

Currently, the hm_harmonize_sumstats.checkref relies on Pandas .apply method to call the check_status function on each row of the sumstats dataframe, resulting in a rather high time complexity. The same operation can be done in a vectorized way, using only numpy operations. With the proposed PR, we can get an x15 time improvement, going from ~15 min down to ~1 min for such harmonization step:

Original:

2024/03/15 10:17:08  -Reference genome FASTA file: /scratch/gwas/GCA_000001405.14_GRCh37.p13_full_analysis_set.fna
2024/03/15 10:17:08  -Checking records: 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  X  Y  M  
2024/03/15 10:31:19  -Variants allele on given reference sequence :  12723699

Modified:

2024/03/15 09:33:13  -Reference genome FASTA file: /scratch/gwas/GCA_000001405.14_GRCh37.p13_full_analysis_set.fna
2024/03/15 09:33:13  -Checking records: 2024/03/15 09:33:31  -Checking records with custom fast_check_status()
2024/03/15 09:33:33    -Building numpy fasta records from dict
2024/03/15 09:33:48    -Checking records for ( len(NEA) <= 4 and len(EA) <= 4 )
2024/03/15 09:34:27    -Checking records for ( len(NEA) > 4 or len(EA) > 4 )
2024/03/15 09:34:32  -Finished checking records
2024/03/15 09:35:26  -Variants allele on given reference sequence :  12723699

The core idea is to first load all sequences contained in the Fasta file and convert them to integers in a super fast way. Then we can also convert NEA and EA columns into integer matrices. At this point we can do all the checks and flips in a vectorized way using numpy operations.

Note that I've commented the code heavily to make it as clear as possible.

Also, I have tested the proposed PR against 10 different input files, and the results (i.e., the Sumstats dataframe) are equivalent to those obtained using the untouched implementation. Of course I invite you to do further tests even on your side.

Cloufield commented 8 months ago

Hi, This looks absolutely amazing! Thank you so much for your effort! I can tell that your implementation is really smart and your coding style is much better than mine. I will check your implementation and run a few tests soon, and then merge this into the main branch.

Cloufield commented 8 months ago

Hi, I tested the code using a hg19 test dataset (https://github.com/Cloufield/gwaslab/blob/main/examples/toy_data/to_harmonize.tsv) but I noticed for some cases the results(STATUS codes) are not the same for the last few rows. Probably the part for assigning status needs some revision? And I am thinking for now, we can add this as a new mode (like mode="vector") for further compatibility testing, instead of overwriting the original implementation. How do you think?

Original version:

Vectorized version:

sup3rgiu commented 8 months ago

Hi, Nice catch! The problem was that there was a bug if the sumstats dataframe didn't contain all the chromosomes contained in the Fasta file. I fixed it in the last PR and now it should be even faster when the dataframe contains only some chroms. Let me know!

Also, I agree that could be useful to keep both implementations for now.

Cloufield commented 8 months ago

I confirmed the new commit works perfectly for the test dataset. After merging, I will change some codes to make this an option in check_ref for now. Thank you for your valuable contribution!

Cloufield / gwaslab

Vectorize checkref #80