atifrahman / HAWK

Hitting associations with k-mers
GNU General Public License v3.0
46 stars 20 forks source link

Expected Runtimes for HAWK #8

Open jasongallant opened 6 years ago

jasongallant commented 6 years ago

Hello- thanks for developing this very intriguing software.

I'm currently exploring its use for several resequenced genomes (~1GB each). I'm curious what expected runtimes are? Currently, I am using in a shared/HPC configuration. When running the runHAWK script, and am unsure of the wall time/memory requirements are. Any pointers or benchmarks?

Second, the strings output by the HAWK.cpp program are not immediately clear what they indicate-- they are many orders of magnitude larger than the total number of k-mers in my dataset. I cannot infer what these correspond to. Some clarity in this may help me benchmark the program on my own data.

atifrahman commented 6 years ago

Thanks for using it!

The run times will depend on the number of samples and the read coverage in each sample in addition to the size of the genome. Our analysis of ~200 YRI and TSI samples from the 1000 genomes project took around 12 days (mostly to do k-mer counting) using 30 cores. The analysis of E.coli ampicillin resistance data set took about 2 days. It should run in 64GB memory and requirements can be adjusted by decreasing valInc in hawk.cpp.

The case_out_w_bonf.kmerDiff and control_out_w_bonf.kmerDiff files output by hawk.cpp contains k-mers that passed Bonferroni correction (before correcting for co-factors). Unless something is going wrong, they should contain much smaller number of k-mers compared to total number of k-mers. The files may be large because they contain k-mer strings, p-values and counts of the k-mer in each sample.