Poor utilization of threads (maybe user error?)

PacificBiosciences / HiPhase

Small variant, structural variant, and short tandem repeat phasing tool for PacBio HiFi reads

Other

71 stars 4 forks source link

Poor utilization of threads (maybe user error?) #9

Closed mrvollger closed 1 year ago

mrvollger commented 1 year ago

Hi @holtjma,

I am seeing that when I give hiphase 32 threads it only uses 150-300% CPU (see screenshot blow with top and run log). Is this expected? And if not do you have any recommendations? This is the command I am using:

        hiphase -t 32 \
            --bam {input.bam} \
            --vcf {input.vcf} \
            --reference {input.ref} \
            --output-bam {output.bam} \
            --output-vcf {output.vcf} \
            --summary-file {output.summary} \
            --stats-file {output.stats} \
            --blocks-file {output.blocks}

Thanks in advance! Mitchell

holtjma commented 1 year ago

Yea, this is an bottleneck we're aware of that's specifically related to writing haplotagged files. The phasing itself is parallelized well, but the writing of files is still handled in a single-threaded manner. If you are not writing BAM files, this isn't really an issues because the file sizes are small, but once you starting haplotagging the tool quickly becomes thread and/or I/O bound. Improving this is on our longer-term TODO list.

mrvollger commented 1 year ago

Thanks for the info!

This might not be helpful but I have found that up to 8-16 threads setting this option can really speed things up!

// stuff reading in a bam file and a header from that bam
// ... 
let threads = 16;
let mut out = bam::Writer::from_path(out, &header, bam::Format::Bam).unwrap()
out.set_threads(threads).unwrap();

this of course assumes you use rust, rust-htslib, etc.

But when I use this I can write >10,000 pacbio reads per second.

mrvollger commented 1 year ago

Can confirm that it is much faster without the bam output file. But FYI I am still not seeing great utilization for all 32 threads.

holtjma commented 1 year ago

I'm not entirely sure what I'm looking at on that top readout. Is the rg command providing sequential timepoints?

Regardless, there is likely some optimization of threads that can happen around all forms of I/O and parallelization. Most internal tests so far have been on 16 threads, and we have not revisited parallelization components probably since proof-of-concept. Historically, they were not the bottlenecks, but we may need to revisit that if further speed improvements get prioritized.

mrvollger commented 1 year ago

Ahh sorry. rg is just a grep alternative I like and it's just searching top for updates with hiphase over a minute or so.

But I was able to remove the need for the bam with the new haplotag file you made for me and I am happy with that speed. So feel free to close if you want, or leave open to bookmark potential future improvements.

holtjma commented 1 year ago

v0.10.0 leverages the thread pools provided by htslib. This was the lowest hanging fruit in the short term for optimizing I/O. Internally, we saw about a 40% speedup while haplotagging, although mileage will vary there across systems and depending on contention.