Uncontrolled Multi CPU Threading in FastSurfer (even when setting value for --threads)

LeHenschel commented 1 year ago

Description

Usage of used cpu-threads is not controllable via the --threads environment for FastSurfer segmentation modules. In the FastSurfer surface pipeline, controllability is only given when threads is set to 1.

Overall, also setting the environment variable OMP_NUM_THREADS in run_fastsurfer.sh instead of recon-surf.sh may solve the issue for --threads 1. Other assignments (threads > 1) are, however, not guaranteed to keep the cpu usage to the determined thread number (neither in the segmentation nor the surface module). The issue here is numpys multi-processing:

In it's default state, numpy will use all available threads for all functions compiled against multi-processing compatible C libraries (OpenBLAS, MKL,...). This can cause issues in two ways a.) cpu overload when running in parallel, b.) slowdown of functions for small matrices/operations (unnecessary overhead basically). There is no option to change this in numpy per se (mainly because a catch-all solution for all the different C libraries is difficult: see e.g. https://github.com/numpy/numpy/issues/16990, https://github.com/numpy/numpy/issues/11826).

Short term solution

Set all possible relevant environment variables to a specific value before (!) numpy is imported. This is a simple solution with the drawback that all relevant variables (https://stackoverflow.com/questions/30791550/limit-number-of-threads-in-numpy) have to be known and changed (and the list might change).

Permanent fix

The current recommendation (per this discussion on the numpy github: https://github.com/numpy/numpy/issues/11826) is to use the threadpoolctl package to wrap all relevant functions. This way, user-specified thread variables can actually be used, rather than limiting everything to 1. This would require several changes in Lapy and FastSurfer.

dkuegler commented 11 months ago

I think multi-cpu management is still an open issue. I thought limiting the cpu availability via singularity (or docker) might actually be the best option, as documented in https://docs.sylabs.io/guides/main/user-guide/cgroups.html But there it also adds another way to limit the cpu usage -- through systemd-run, which should be available in ubuntu 22.04 by default (https://docs.sylabs.io/guides/main/user-guide/cgroups.html#applying-resource-limits-with-external-tools, https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html)

I am not really sure whether or not this solves the issue, but it is worth having a look at...

dkuegler commented 11 months ago

What should also be mentioned here, right now, the default value for --threads in run_fastsurfer.sh is 1, which means that both inference and segstats.py get significantly slower. I am not too sure about N4, it might be N4 is currently also "circumventing" the thread limitation. Generally, this means that you need to manually specify a reasonable value for --threads to get close to the 1minute for segmentation target.

Deep-MI / FastSurfer