Q: hicBuildMatrix slow - how best to improve throughput

keiranmraine commented 5 years ago

I'm finding it takes a considerable time to process our input BAMs (from BWA, generated as described here) with the recommended threads (8) against a high performance file system. I've also noted that the used cpu time factor is only ~1.3 rather than close to 8 as would be expected:

hicBuildMatrix --binSize 10000 --inputBufferSize 400000 --restrictionSequence GATC --threads 8 --outBam hicBuildMatrix/WTSI-OESO_103.hic.bam -o hicBuildMatrix/WTSI-OESO_103.hic_matrix.h5 --QCfolder hicBuildMatrix/WTSI-OESO_103-hicQC --samFiles mapped/WTSI-OESO_103_R1.bam mapped/WTSI-OESO_103_R2.bam

Each input is ~65GB (reads/file ~618 million). I'm also creating a version will all secondary/supplementary reads removed which will reduce the raw reads by ~20%.

What is the best way to improve throughput?

Increasing the buffer size?
Can the --region parameter be used to generate a matrix per chromosome for merging that is comparable to running it over the whole input? If so, which tool can merge this information

keiranmraine commented 5 years ago

Code wise I think adding the threads option to the pysam readers would help greatly. As it stands a new thread has to wait on the generation of the next set of reads from the parent process making this the rate limiting step. I had this issue in a similar piece of code.

https://github.com/deeptools/HiCExplorer/blob/3b7abc349a8f2aa31a10d4d79ac4ed100b3b7aa1/hicexplorer/hicBuildMatrix.py#L1069-L1070

I think it's safe to set threads=2 for AlignmentFile/Samfile when args.threads>1.

joachimwolff commented 5 years ago

Hi,

The limiting factor is the read performance and the single thread performance. We need to read both BAM files line by line to find the correct corresponding forward and reverse reads. However, there is in our knowledge no way to make this sequential process to run in parallel. Even if it is possible to somehow get iterators to read different chunks of the file, the input bandwidth is still the same and needs now to be shared. These considerations are the reasons for the current implementation.

To improve the performance we recommend to increase the buffer size, but this comes with higher memory usage. However, this helps only if the the input parent threads is able to read the input faster as the n worker threads have done their processing. If the workers are faster, the parallel performance will not improve. In my experience 4 - 6 threads give you an improvement of a factor of 3 compared to the single core version.

What can help too is to not set --outBam parameter if the resulting BAM file is not needed.

To set the --region parameter will not give you a performance increase. Still, all lines need to be read given the unordered nature of our BAM files. It is not possible to order it because the information which forward and reverse read are associated would be lost. However, if you want to try this this will give you only the intra-chromosomal reads and no inter-chromosomal reads. Unfortunately, we do not offer a tool to merge these kind of matrices. hicSumMatrices is only able to merge matrices of the same size (which you don't have).

Concerning threads=2 for pysam.Samfile, I will test this and inform you how well that works. Thanks for this tip.

Best,

Joachim

deeptools / HiCExplorer

Q: hicBuildMatrix slow - how best to improve throughput #393