Closed pdedumast closed 3 years ago
@pdedumast For the node srtkHRMask
, what about creating a simple interface BinarizeImage
(BaseInterface
) ?
For the node stackOrdering
it is not clear for me what is going wrong exactly.
When you run it with "skip_stacks_ordering": true, "do_refine_hr_mask":false
, what is the terminal output ?
@sebastientourbier This is what I had done for the BinarizeImage(BaseInterface)
node.
However, as the process is failing when "skip_stacks_ordering": false
, it is when using the node StacksOrdering(BaseInterface)
and not with the IdentityInterface
one. No idea why...
The trace was:
exception calling callback for <Future at 0x2b43e20b8f98 state=finished raised BrokenProcessPool>
Traceback (most recent call last):
File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/concurrent/futures/_base.py", line 324, in _invoke_callbacks
callback(self)
File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/site-packages/nipype/pipeline/plugins/multiproc.py", line 159, in _async_callback
result = args.result()
File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/concurrent/futures/_base.py", line 425, in result
return self.__get_result()
File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
[MultiProc] Running 0 tasks, and 1 jobs ready. Free memory (GB): 113.24/113.24, Free processors: 2/2.
210120-15:39:31,27 nipype.workflow INFO:
[Node] Setting-up "srr_pipeline.stackOrdering" in "/output_dir/nipype/sub-ctrl0030/ses-20170306142030/rec-1/srr_pipeline/stackOrdering".
210120-15:39:31,49 nipype.workflow INFO:
[Node] Running "stackOrdering" ("pymialsrtk.interfaces.preprocess.StacksOrdering")
src/tcmalloc.cc:277] Attempt to free invalid pointer 0x14
Fatal Python error: Aborted
Current thread 0x00002ae8aa65ec80 (most recent call first):
File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/site-packages/skimage/measure/_moments.py", line 271 in moments_central
File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/site-packages/skimage/measure/_moments.py", line 195 in moments
File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/site-packages/pymialsrtk/interfaces/preprocess.py", line 1005 in _compute_motion_index
File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/site-packages/pymialsrtk/interfaces/preprocess.py", line 1026 in _compute_stack_order
File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/site-packages/pymialsrtk/interfaces/preprocess.py", line 969 in _run_interface
File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/site-packages/nipype/interfaces/base/core.py", line 419 in run
File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/site-packages/nipype/pipeline/engine/nodes.py", line 741 in _run_command
File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/site-packages/nipype/pipeline/engine/nodes.py", line 635 in _run_interface
File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/site-packages/nipype/pipeline/engine/nodes.py", line 516 in run
File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/site-packages/nipype/pipeline/plugins/multiproc.py", line 67 in run_node
File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/concurrent/futures/process.py", line 175 in _process_worker
File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/multiprocessing/process.py", line 93 in run
File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/multiprocessing/process.py", line 258 in _bootstrap
File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/multiprocessing/popen_fork.py", line 73 in _launch
File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/multiprocessing/popen_fork.py", line 19 in __init__
File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/multiprocessing/context.py", line 277 in _Popen
File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/multiprocessing/context.py", line 223 in _Popen
File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/multiprocessing/process.py", line 105 in start
File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/concurrent/futures/process.py", line 446 in _adjust_process_count
File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/concurrent/futures/process.py", line 427 in _start_queue_management_thread
File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/concurrent/futures/process.py", line 466 in submit
File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/site-packages/nipype/pipeline/plugins/multiproc.py", line 153 in __init__
File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/site-packages/nipype/pipeline/engine/workflows.py", line 617 in run
File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/site-packages/pymialsrtk/pipelines/anatomical/srr.py", line 534 in run
File "/opt/mialsuperresolutiontoolkit/docker/bidsapp/run.py", line 217 in main
File "/opt/mialsuperresolutiontoolkit/docker/bidsapp/run.py", line 275 in <module>
210120-15:39:32,909 nipype.workflow INFO:
[MultiProc] Running 1 tasks, and 0 jobs ready. Free memory (GB): 113.04/113.24, Free processors: 1/2.
Currently running:
* srr_pipeline.stackOrdering
exception calling callback for <Future at 0x2ae9745df438 state=finished raised BrokenProcessPool>
Traceback (most recent call last):
File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/concurrent/futures/_base.py", line 324, in _invoke_callbacks
callback(self)
File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/site-packages/nipype/pipeline/plugins/multiproc.py", line 159, in _async_callback
result = args.result()
File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/concurrent/futures/_base.py", line 425, in result
return self.__get_result()
File "/opt/conda/envs/pymialsrtk-env/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
@pdedumast Based on the error:
src/tcmalloc.cc:277] Attempt to free invalid pointer 0x14
Fatal Python error: Aborted
It seems related to memory allocation. Here is a similar issue: https://github.com/google/seq2seq/issues/119#issue-217508719
Based on this, the workaround could be to change the library for memory allocation. It seems there are 3 different libraries for that: tcalloc / jealloc / ptalloc; (See https://stackoverflow.com/questions/9866145/what-are-the-differences-between-and-reasons-to-choose-tcmalloc-jemalloc-and-m)
Previously I had to install tcalloc from google-perftools to make tensorflow happy: https://github.com/Medical-Image-Analysis-Laboratory/mialsuperresolutiontoolkit/blob/108e537c480e3317145c389cee3adddb582a4e23/docker/bidsapp/Dockerfile#L58-L60
But following the post it seems the library that is the best suit might depend on the application, and it seems Nipype with multiproc is not happy with tcalloc library...
What is strange is that the Docker image runs correctly while the Singularity image fails with the same run configuration. Do you run the singularity image with the option --containall
?
@sebastientourbier I did use the --containall
option.
I should mention that during my first tests, I noticed that this error might be occurring differently according to how --openmp_nb_of_cores
and --nipype_nb_of_cores
are set. Though I did not identify how...
@sebastientourbier I confirm that the pipeline is running perfectly well on our cluster when job parameter
#SBATCH --cpus-per-task=1
and command line arguments --openmp_nb_of_cores
and --nipype_nb_of_cores
are not set
@sebastientourbier @pdedumast it is working as well well with:
#SBATCH --ntasks=8
#SBATCH --ntasks-per-node=1
..
--openmp_nb_of_cores 7 \
--nipype_nb_of_cores 1
@pdedumast When the commandline args --openmp_nb_of_cores
and --nipype_nb_of_cores
are not set, this should have the behavior to set openmp_nb_of_cores to 1 and nipype_nb_of_cores to 1, is that correct?
@hamzake What's happening when --openmp_nb_of_cores 1
and --nipype_nb_of_cores 7
?
If this last test works, this confirms a conflict between OpenMP and Nipype threads.
To fix it, we could then think in:
pymialsrtk/parser.py
) following the example at https://docs.python.org/3/library/argparse.html#mutual-exclusion@sebastientourbier: "What's happening when --openmp_nb_of_cores 1 and --nipype_nb_of_cores 7?"
It's working.
The conflict is maybe linked to the SBATCH configuration because it is also working with the following:
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
..
--openmp_nb_of_cores 4 \
--nipype_nb_of_cores 4
We have indeed to be careful in the SBATCH config.
What we would like to achieve is to run the "singularity run ..." command for one participant label on one Node using the 16 available CPUs.
From https://slurm.schedmd.com/sbatch.html, it might correspond in setting it as the following:
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=16
@hamzake In your last comment
The conflict is maybe linked to the SBATCH configuration because it is also working with the following:
SBATCH --nodes=2
SBATCH --ntasks-per-node=2
.. --openmp_nb_of_cores 4 --nipype_nb_of_cores 4
with --openmp_nb_of_cores 4
and --nipype_nb_of_cores 4
, this will use 4 * 4 = 16 cpus. Not sure how many cpus are then available for the task. What is the value of OMP_NUM_THREADS printed as output in the terminal? Also, this will distribute a task between two different nodes.
@sebastientourbier The value of OMP_NUM_THREADS printed as output in the terminal is 4. And the same value for the number of cores used by Nipype engine.
By running the pipeline with the following configuration:
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=16
I reproduce the same error as @pdedumast :
Attempt to free invalid pointer 0x14 Fatal Python error: Aborted
Not sure what the problem is, I observed that - on the cluster with singularity - the pipeline is failing when Function() (and IdentityInterface ?) nodes are used.
With the custom interfaces
"skip_stacks_ordering": true, "do_refine_hr_mask":true
, the 2 problematic nodes are bypassed and the processing is going well.Problematic nodes:
https://github.com/Medical-Image-Analysis-Laboratory/mialsuperresolutiontoolkit/blob/67835f55a5d9703c9129969ca1c5bd4f6ce9ec52/pymialsrtk/pipelines/anatomical/srr.py#L397-L398
and
https://github.com/Medical-Image-Analysis-Laboratory/mialsuperresolutiontoolkit/blob/67835f55a5d9703c9129969ca1c5bd4f6ce9ec52/pymialsrtk/pipelines/anatomical/srr.py#L306-L307