chhylp123 / hifiasm

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads
MIT License
526 stars 86 forks source link

Feature request: Separate hifiasm into stages #608

Open SHuang-Broad opened 7 months ago

SHuang-Broad commented 7 months ago

Hi,

Is it possible to separate hifiasm into stages (e.g. separating the read-error correction step and the phased string graph generation step)?

The application that initially led us to ask for this functionality is when we want to have both the diploid assembly and the alternative contigs for some investigation.

Thank you! Steve

vellamike commented 7 months ago

I am also interested in this, I looked at doing it by modification of the source code and while I succeeded it was quite challenging and the solution I came up with was a little bit hacky.

baozg commented 7 months ago

You can easily rerun with the bin file to get primary/alternative, dual assembly or trio/hic assembly if you use the same prefix

SHuang-Broad commented 7 months ago

Oh, that's good to know., @baozg Just to confirm, hifiasm will automatically "resume" the work, if it detects the bin files matching the provided prefix?

baozg commented 7 months ago

Yes, hifiasm will reuse all the bin files if they exist. But be careful if it is generated by a different version of hifiasm.

SHuang-Broad commented 7 months ago

Awesome! I'll test run with our samples and report back.

Thank you @baozg !

chhylp123 commented 7 months ago

Hello @vellamike @SHuang-Broad @baozg , sorry for the late reply since I was too busy during the last few weeks. Actually the ‘--bin-only’ might work. For example, if you would like to run hifiasm (Hi-C) in one step, then the command line should as follows:

hifiasm -t48 –h1 HiC_r1.fq –h2 HiC_r2.fq HiFi.fq

With ‘--bin-only’, the whole assembly procedure could be separated into two steps:

hifiasm -t48 –h1 HiC_r1.fq –h2 HiC_r2.fq --bin-only HiFi.fq ///hifiasm will only produce bin files for error correction hifiasm -t48 –h1 HiC_r1.fq –h2 HiC_r2.fq --bin-only HiFi.fq ///hifiasm will reuse the bin files

Basically, hifiasm will directly stop if any bin files have been generated with ‘--bin-only’.

SHuang-Broad commented 6 months ago

Thank you, @chhylp123 !

Following your suggestion, I ran a few experiments and it works as expected!

I've attached a few plots here demonstrating how CPU, memory and disk space is used throughout the process. Hopefully this is useful. For bin generation, I used 42 cores. For the actual assembly steps, I used 28 cores.

Btw, this --bin-only flag isn't documented anywhere but I believe it should. Here's the reason: you can see from the monitoring plots, that the bin-generation stage is the main "bottleneck". It needs the most amount of resources and lasts 16 hours. The assembly steps, not only use just a few threads most of the time (~2 hours), but don't need as much memory either. For those of us who do computations in the cloud, we can reduce costs by using non-spot VMs for the bin-generation stage, and switch over to spot VMs configured with less resources.

Again, thank you for the suggestion! AltModeUsingBinFiles.HighCoverage.monitoring.log.pdf BinGeneration.HighCoverage.monitoring.log.pdf HapModeUsingBinFiles.HighCoverage.monitoring.log.pdf

Steve

hazmup commented 3 months ago

Hi! After reading this I am still not sure how to reuse the bin files. Are all the generated bin files needed? My -o includes the path, and tried both to use the same file prefix in a different folder, and also to rerun in the same folder and both times it seems the whole procedure is rerun. How should it be done? Is there an easy way to check if the pipeline is resuming or running from the beginning? Thank you in advance, Stelios

chhylp123 commented 3 months ago

Basically, just rerun hifiasm with the same option for -o, hifiiasm will reuse the bin files. The log file will tell you if the bin files have been reused. if the pipeline is resuming, hifiasm will skip the whole error correction step without printing any k-mer histogram.

hazmup commented 3 months ago

This is not happening for me. I run hifiasm again for a different sample, and when I tried to reuse the bin files of the first sample using the original -o prefix, it reruns the whole pipeline. Maybe something got slightly mixed up, I will try again when it finishes. Thanks!