atifrahman / HAWK

Hitting associations with k-mers
GNU General Public License v3.0
46 stars 20 forks source link

hawk.out run too long #20

Open SC-Duan opened 4 years ago

SC-Duan commented 4 years ago

Hi, I have 91 samples and it takes too long time (already 18 days) to run "hawk.out 42 49", the hawk_out.txt file is still empty, I set noThread=10, what is wrong with that? The read coverage in each sample is about 8x and the size of the genome is 2.2Gb. Thank you!

atifrahman commented 4 years ago

That is really surprising. For the datasets we have analyzed, hawk takes much less time compared to the time needed for counting k-mers. How long did k-mer counting take?

We ran hawk on ~200 human samples and it took about a day (with 32 threads).

SC-Duan commented 4 years ago

k-mer counting are very fast, is there some methods to check the problem of hawk.out?

atifrahman commented 4 years ago

Can you please share the 'hawk_out.txt' file?

SC-Duan commented 4 years ago

hawk_out.txt file is empty.

SonjaKersten commented 3 years ago

Hi, I have the same issue. I'm running runHawk on Reads_case_sorted.txt and Reads_control_sorted.txt of 124Gb each for 13 days now. The hawk_out.txt file is still empty. What could be potentially wrong?

robertwhbaldwin commented 3 years ago

Did someone figure out what the problem was? I may be having the same issue. thanks - RObert

robertwhbaldwin commented 3 years ago

I think that i may have a similar problem. I'm running runHawk with kmer counts from 50 samples sequenced to ~9x coverage (genome size 2 g). Each kmer file was 20-25G. Anyways, I'm running Hawk on an AWS EC2 instance and have already racked up $200 in charges I'd like to know if I should kill it or let it run. The Kmer counting step took about 2 days (1 hour per sample). The runHawk has been going for 23 hrs. I was expecting it to have finished by now. The instance type is m5ad.4xlarge with 16 vCPU and 64 G RAM. thanks - RObert

atifrahman commented 3 years ago

Has anything been written to case_out_wo_bonf.kmerDiff and control_out_wo_bonf.kmerDiff? If not, you can probably kill it.

robertwhbaldwin commented 3 years ago

No neither file has anything in it. I'll have to stop the job. It would be good to know what the problem was (too few resources) and how to spot it early on. None of the files initially created by runHawk had anything being added to them over the run. If there's some way to check to see if things are progressing or not that would be helpful. Also any recommendations on what compute resources should be used (more threads, faster cpu, more RAM etc) because I could launch it with a new instance. I should also point out that I had to keep the input files on the ESB (remote) storage and not the local storage which would hamper performance but I don't think that's the issue here. - Thanks - RObert

atifrahman commented 3 years ago

Sorry about that!

We'll look into it. We never encountered this on any of the datasets we used. The datasets are so large that it's difficult for others to share them so that we can debug. We'll give it another shot.

robertwhbaldwin commented 3 years ago

I tried runHawk again on a different instance. I started over from the beginning reinstalling all software. I ran it with 35 threads and ~80Gigs RAM and moved the input sorted kmer files to local SSD storage. I left it over night to run and after ~8hrs found it still running with no change to the output files. I saved the AMI but will not be attempting this again unless the issue is resolved. My input sorted Kmer files were 25 gigs each (50 samples). Does that seem too large for a 2 gig diploid genome at 9X coverage? Let me know if I can help in any way to resolve this issue.

atifrahman commented 3 years ago

Can you please share one or two of the k-mer count files? We can check whether they are in the expected format. When I tried it on ~200 human samples, the total size of k-mer count files was >5TB. So, they don't seem unreasonably large.

robertwhbaldwin commented 3 years ago

Do you want the whole files?

atifrahman commented 3 years ago

If possible, yes. You can upload them somewhere and share the link by emailing me at atif.bd@gmail.com

SonjaKersten commented 3 years ago

I'm also still on the same issue. However, I don't know, whether it is the file size or the fact that I run it on only two pools of a bulked segregant experiment (two samples). I would appreciate if you let me know how the issue got resolved. Thanks Sonja

robertwhbaldwin commented 3 years ago

For those still dealing with this problem, it turns out that my kmer file had the incorrect format. Check your kmer file. I ran the kmer step with the unmodified version of jellyfish2 but applied the patch provided by HAWK. My kmer file was incorrect because the first column had the kmer strings. The first column should be a number representing the kmer string and not the kmer string itself. The second column should be the count for the kmer.

robertwhbaldwin commented 3 years ago

And I'll add that you can install jellyfish unmodified version but you may need to use version 2.2.10 as suggested in the HAWK documentation. I tried using the patch to a more recent version of jellyfish2 and the output was not formatted properly. WHen I applied to 2.2.10 it was fixed.

SonjaKersten commented 3 years ago

Thanks Robert, I will check and try it out.