Medical-Image-Analysis-Laboratory / mialsuperresolutiontoolkit

The Medical Image Analysis Laboratory Super-Resolution ToolKit (MIALSRTK) consists of a set of C++ and Python processing and workflow tools necessary to perform motion-robust super-resolution fetal MRI reconstruction in the BIDS Apps framework.
BSD 3-Clause "New" or "Revised" License
26 stars 12 forks source link

BUG: Encapsulate StacksOrdering and srtkHRmask in nipype nodes #75

Closed pdedumast closed 3 years ago

pdedumast commented 3 years ago

Not sure what the problem is, I observed that - on the cluster with singularity - the pipeline is failing when Function() (and IdentityInterface ?) nodes are used.

With the custom interfaces "skip_stacks_ordering": true, "do_refine_hr_mask":true, the 2 problematic nodes are bypassed and the processing is going well.

Problematic nodes:

https://github.com/Medical-Image-Analysis-Laboratory/mialsuperresolutiontoolkit/blob/67835f55a5d9703c9129969ca1c5bd4f6ce9ec52/pymialsrtk/pipelines/anatomical/srr.py#L397-L398

and

https://github.com/Medical-Image-Analysis-Laboratory/mialsuperresolutiontoolkit/blob/67835f55a5d9703c9129969ca1c5bd4f6ce9ec52/pymialsrtk/pipelines/anatomical/srr.py#L306-L307

sebastientourbier commented 3 years ago

@pdedumast For the node srtkHRMask, what about creating a simple interface BinarizeImage (BaseInterface) ?

For the node stackOrdering it is not clear for me what is going wrong exactly. When you run it with "skip_stacks_ordering": true, "do_refine_hr_mask":false, what is the terminal output ?

pdedumast commented 3 years ago

@sebastientourbier This is what I had done for the BinarizeImage(BaseInterface) node.

However, as the process is failing when "skip_stacks_ordering": false, it is when using the node StacksOrdering(BaseInterface) and not with the IdentityInterface one. No idea why...

The trace was:

exception calling callback for <Future at 0x2b43e20b8f98 state=finished raised BrokenProcessPool>
Traceback (most recent call last):
  File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/concurrent/futures/_base.py", line 324, in _invoke_callbacks
    callback(self)
  File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/site-packages/nipype/pipeline/plugins/multiproc.py", line 159, in _async_callback
    result = args.result()
  File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/concurrent/futures/_base.py", line 425, in result
    return self.__get_result()
  File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
pdedumast commented 3 years ago
         [MultiProc] Running 0 tasks, and 1 jobs ready. Free memory (GB): 113.24/113.24, Free processors: 2/2.
210120-15:39:31,27 nipype.workflow INFO:
         [Node] Setting-up "srr_pipeline.stackOrdering" in "/output_dir/nipype/sub-ctrl0030/ses-20170306142030/rec-1/srr_pipeline/stackOrdering".
210120-15:39:31,49 nipype.workflow INFO:
         [Node] Running "stackOrdering" ("pymialsrtk.interfaces.preprocess.StacksOrdering")
src/tcmalloc.cc:277] Attempt to free invalid pointer 0x14
Fatal Python error: Aborted

Current thread 0x00002ae8aa65ec80 (most recent call first):
  File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/site-packages/skimage/measure/_moments.py", line 271 in moments_central
  File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/site-packages/skimage/measure/_moments.py", line 195 in moments
  File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/site-packages/pymialsrtk/interfaces/preprocess.py", line 1005 in _compute_motion_index
  File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/site-packages/pymialsrtk/interfaces/preprocess.py", line 1026 in _compute_stack_order
  File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/site-packages/pymialsrtk/interfaces/preprocess.py", line 969 in _run_interface
  File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/site-packages/nipype/interfaces/base/core.py", line 419 in run
  File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/site-packages/nipype/pipeline/engine/nodes.py", line 741 in _run_command
  File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/site-packages/nipype/pipeline/engine/nodes.py", line 635 in _run_interface
  File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/site-packages/nipype/pipeline/engine/nodes.py", line 516 in run
  File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/site-packages/nipype/pipeline/plugins/multiproc.py", line 67 in run_node
  File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/concurrent/futures/process.py", line 175 in _process_worker
  File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/multiprocessing/process.py", line 93 in run
  File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/multiprocessing/process.py", line 258 in _bootstrap
  File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/multiprocessing/popen_fork.py", line 73 in _launch
  File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/multiprocessing/popen_fork.py", line 19 in __init__
  File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/multiprocessing/context.py", line 277 in _Popen
  File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/multiprocessing/context.py", line 223 in _Popen
  File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/multiprocessing/process.py", line 105 in start
  File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/concurrent/futures/process.py", line 446 in _adjust_process_count
  File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/concurrent/futures/process.py", line 427 in _start_queue_management_thread
  File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/concurrent/futures/process.py", line 466 in submit
  File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/site-packages/nipype/pipeline/plugins/multiproc.py", line 153 in __init__
  File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/site-packages/nipype/pipeline/engine/workflows.py", line 617 in run
  File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/site-packages/pymialsrtk/pipelines/anatomical/srr.py", line 534 in run
  File "/opt/mialsuperresolutiontoolkit/docker/bidsapp/run.py", line 217 in main
  File "/opt/mialsuperresolutiontoolkit/docker/bidsapp/run.py", line 275 in <module>
210120-15:39:32,909 nipype.workflow INFO:
         [MultiProc] Running 1 tasks, and 0 jobs ready. Free memory (GB): 113.04/113.24, Free processors: 1/2.
                     Currently running:
                       * srr_pipeline.stackOrdering
exception calling callback for <Future at 0x2ae9745df438 state=finished raised BrokenProcessPool>
Traceback (most recent call last):
  File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/concurrent/futures/_base.py", line 324, in _invoke_callbacks
    callback(self)
  File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/site-packages/nipype/pipeline/plugins/multiproc.py", line 159, in _async_callback
    result = args.result()
  File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/concurrent/futures/_base.py", line 425, in result
    return self.__get_result()
  File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
sebastientourbier commented 3 years ago

@pdedumast Based on the error:

src/tcmalloc.cc:277] Attempt to free invalid pointer 0x14
Fatal Python error: Aborted

It seems related to memory allocation. Here is a similar issue: https://github.com/google/seq2seq/issues/119#issue-217508719

Based on this, the workaround could be to change the library for memory allocation. It seems there are 3 different libraries for that: tcalloc / jealloc / ptalloc; (See https://stackoverflow.com/questions/9866145/what-are-the-differences-between-and-reasons-to-choose-tcmalloc-jemalloc-and-m)

Previously I had to install tcalloc from google-perftools to make tensorflow happy: https://github.com/Medical-Image-Analysis-Laboratory/mialsuperresolutiontoolkit/blob/108e537c480e3317145c389cee3adddb582a4e23/docker/bidsapp/Dockerfile#L58-L60

But following the post it seems the library that is the best suit might depend on the application, and it seems Nipype with multiproc is not happy with tcalloc library...

What is strange is that the Docker image runs correctly while the Singularity image fails with the same run configuration. Do you run the singularity image with the option --containall ?

pdedumast commented 3 years ago

@sebastientourbier I did use the --containall option.

I should mention that during my first tests, I noticed that this error might be occurring differently according to how --openmp_nb_of_cores and --nipype_nb_of_cores are set. Though I did not identify how...

pdedumast commented 3 years ago

@sebastientourbier I confirm that the pipeline is running perfectly well on our cluster when job parameter #SBATCH --cpus-per-task=1 and command line arguments --openmp_nb_of_cores and --nipype_nb_of_cores are not set

hamzake commented 3 years ago

@sebastientourbier @pdedumast it is working as well well with:

#SBATCH --ntasks=8
#SBATCH --ntasks-per-node=1
..
--openmp_nb_of_cores 7 \
--nipype_nb_of_cores 1
sebastientourbier commented 3 years ago

@pdedumast When the commandline args --openmp_nb_of_cores and --nipype_nb_of_cores are not set, this should have the behavior to set openmp_nb_of_cores to 1 and nipype_nb_of_cores to 1, is that correct?

@hamzake What's happening when --openmp_nb_of_cores 1 and --nipype_nb_of_cores 7?

If this last test works, this confirms a conflict between OpenMP and Nipype threads.

To fix it, we could then think in:

  1. making the two argparse arguments mutually exclusive (pymialsrtk/parser.py) following the example at https://docs.python.org/3/library/argparse.html#mutual-exclusion
  2. reviewing the lines to handle the two mutually exclusion parser arguments: https://github.com/Medical-Image-Analysis-Laboratory/mialsuperresolutiontoolkit/blob/108e537c480e3317145c389cee3adddb582a4e23/docker/bidsapp/run.py#L226-L238 This can require to make some changes to https://github.com/Medical-Image-Analysis-Laboratory/mialsuperresolutiontoolkit/blob/108e537c480e3317145c389cee3adddb582a4e23/docker/bidsapp/run.py#L53 and https://github.com/Medical-Image-Analysis-Laboratory/mialsuperresolutiontoolkit/blob/108e537c480e3317145c389cee3adddb582a4e23/docker/bidsapp/run.py#L19
hamzake commented 3 years ago

@sebastientourbier: "What's happening when --openmp_nb_of_cores 1 and --nipype_nb_of_cores 7?"

It's working.

The conflict is maybe linked to the SBATCH configuration because it is also working with the following:

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
..
--openmp_nb_of_cores 4 \
--nipype_nb_of_cores 4
sebastientourbier commented 3 years ago

We have indeed to be careful in the SBATCH config.

What we would like to achieve is to run the "singularity run ..." command for one participant label on one Node using the 16 available CPUs.

From https://slurm.schedmd.com/sbatch.html, it might correspond in setting it as the following:

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=16

@hamzake In your last comment

The conflict is maybe linked to the SBATCH configuration because it is also working with the following:

SBATCH --nodes=2

SBATCH --ntasks-per-node=2

.. --openmp_nb_of_cores 4 --nipype_nb_of_cores 4

with --openmp_nb_of_cores 4 and --nipype_nb_of_cores 4, this will use 4 * 4 = 16 cpus. Not sure how many cpus are then available for the task. What is the value of OMP_NUM_THREADS printed as output in the terminal? Also, this will distribute a task between two different nodes.

hamzake commented 3 years ago

@sebastientourbier The value of OMP_NUM_THREADS printed as output in the terminal is 4. And the same value for the number of cores used by Nipype engine.

By running the pipeline with the following configuration:

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=16

I reproduce the same error as @pdedumast :

Attempt to free invalid pointer 0x14 Fatal Python error: Aborted