WGLab / NanoRepeat

NanoRepeat: fast and accurate analysis of Short Tandem Repeats (STRs) from Oxford Nanopore sequencing data
MIT License
17 stars 1 forks source link

NanoRepeat uses more cores than specified with -c #13

Closed 18parkky closed 6 months ago

18parkky commented 6 months ago

Hi, thanks for developing NanoRepeat!

I'm trying to measure the runtime of NanoRepeat when running with varying number of cores.

However, through Linux's top command, I noticed that NanoRepeat sometimes uses more cores than specified with the -c command. For example, even though I set -c 16, NanoRepeat occasionally uses up to 26 cores. I'm assuming NanoRepeat does this during the alignment step with minimap2, where it uses as many processors as possible, and then shifts back to using less cores in other steps.

Do you know why NanoRepeat does this and any way to fix this?

Thanks, 18parkky

fangli80 commented 6 months ago

Hello 18parkky, Thanks for your inquiry. NanoRepeat uses the GaussianMixture function in the sklearn library to phase the reads. However, the GaussianMixture function inherently utilizes all available cores without a direct parameter to control the number of threads. This is because GaussianMixture depends on some numpy functions which in turn rely on multi-threaded libraries like OpenMP.

The threads used by OpenMP can be controled by the environment variable OMP_NUM_THREADS. I've made adjustments in NanoRepeat to set the OMP_NUM_THREADS environment variable to 1. Despite this limitation in core usage, the impact on total runtime appears minimal from my observations. Please use git clone https://github.com/WGLab/NanoRepeat.git to get the latest version and test it.

Thanks! Li

18parkky commented 6 months ago

Like you said, it seems that the impact of this is minimal in my observations too. Thanks for the reply and update!

18parkky