Open SC-Duan opened 4 years ago
That is really surprising. For the datasets we have analyzed, hawk takes much less time compared to the time needed for counting k-mers. How long did k-mer counting take?
We ran hawk on ~200 human samples and it took about a day (with 32 threads).
k-mer counting are very fast, is there some methods to check the problem of hawk.out?
Can you please share the 'hawk_out.txt' file?
hawk_out.txt file is empty.
Hi, I have the same issue. I'm running runHawk on Reads_case_sorted.txt and Reads_control_sorted.txt of 124Gb each for 13 days now. The hawk_out.txt file is still empty. What could be potentially wrong?
Did someone figure out what the problem was? I may be having the same issue. thanks - RObert
I think that i may have a similar problem. I'm running runHawk with kmer counts from 50 samples sequenced to ~9x coverage (genome size 2 g). Each kmer file was 20-25G. Anyways, I'm running Hawk on an AWS EC2 instance and have already racked up $200 in charges I'd like to know if I should kill it or let it run. The Kmer counting step took about 2 days (1 hour per sample). The runHawk has been going for 23 hrs. I was expecting it to have finished by now. The instance type is m5ad.4xlarge with 16 vCPU and 64 G RAM. thanks - RObert
Has anything been written to case_out_wo_bonf.kmerDiff and control_out_wo_bonf.kmerDiff? If not, you can probably kill it.
No neither file has anything in it. I'll have to stop the job. It would be good to know what the problem was (too few resources) and how to spot it early on. None of the files initially created by runHawk had anything being added to them over the run. If there's some way to check to see if things are progressing or not that would be helpful. Also any recommendations on what compute resources should be used (more threads, faster cpu, more RAM etc) because I could launch it with a new instance. I should also point out that I had to keep the input files on the ESB (remote) storage and not the local storage which would hamper performance but I don't think that's the issue here. - Thanks - RObert
Sorry about that!
We'll look into it. We never encountered this on any of the datasets we used. The datasets are so large that it's difficult for others to share them so that we can debug. We'll give it another shot.
I tried runHawk again on a different instance. I started over from the beginning reinstalling all software. I ran it with 35 threads and ~80Gigs RAM and moved the input sorted kmer files to local SSD storage. I left it over night to run and after ~8hrs found it still running with no change to the output files. I saved the AMI but will not be attempting this again unless the issue is resolved. My input sorted Kmer files were 25 gigs each (50 samples). Does that seem too large for a 2 gig diploid genome at 9X coverage? Let me know if I can help in any way to resolve this issue.
Can you please share one or two of the k-mer count files? We can check whether they are in the expected format. When I tried it on ~200 human samples, the total size of k-mer count files was >5TB. So, they don't seem unreasonably large.
Do you want the whole files?
If possible, yes. You can upload them somewhere and share the link by emailing me at atif.bd@gmail.com
I'm also still on the same issue. However, I don't know, whether it is the file size or the fact that I run it on only two pools of a bulked segregant experiment (two samples). I would appreciate if you let me know how the issue got resolved. Thanks Sonja
For those still dealing with this problem, it turns out that my kmer file had the incorrect format. Check your kmer file. I ran the kmer step with the unmodified version of jellyfish2 but applied the patch provided by HAWK. My kmer file was incorrect because the first column had the kmer strings. The first column should be a number representing the kmer string and not the kmer string itself. The second column should be the count for the kmer.
And I'll add that you can install jellyfish unmodified version but you may need to use version 2.2.10 as suggested in the HAWK documentation. I tried using the patch to a more recent version of jellyfish2 and the output was not formatted properly. WHen I applied to 2.2.10 it was fixed.
Thanks Robert, I will check and try it out.
Hi, I have 91 samples and it takes too long time (already 18 days) to run "hawk.out 42 49", the hawk_out.txt file is still empty, I set noThread=10, what is wrong with that? The read coverage in each sample is about 8x and the size of the genome is 2.2Gb. Thank you!