Open jasongallant opened 6 years ago
Thanks for using it!
The run times will depend on the number of samples and the read coverage in each sample in addition to the size of the genome. Our analysis of ~200 YRI and TSI samples from the 1000 genomes project took around 12 days (mostly to do k-mer counting) using 30 cores. The analysis of E.coli ampicillin resistance data set took about 2 days. It should run in 64GB memory and requirements can be adjusted by decreasing valInc
in hawk.cpp
.
The case_out_w_bonf.kmerDiff
and control_out_w_bonf.kmerDiff
files output by hawk.cpp
contains k-mers that passed Bonferroni correction (before correcting for co-factors). Unless something is going wrong, they should contain much smaller number of k-mers compared to total number of k-mers. The files may be large because they contain k-mer strings, p-values and counts of the k-mer in each sample.
Hello- thanks for developing this very intriguing software.
I'm currently exploring its use for several resequenced genomes (~1GB each). I'm curious what expected runtimes are? Currently, I am using in a shared/HPC configuration. When running the runHAWK script, and am unsure of the wall time/memory requirements are. Any pointers or benchmarks?
Second, the strings output by the HAWK.cpp program are not immediately clear what they indicate-- they are many orders of magnitude larger than the total number of k-mers in my dataset. I cannot infer what these correspond to. Some clarity in this may help me benchmark the program on my own data.