PennLINC / qsiprep

Preprocessing of diffusion MRI
http://qsiprep.readthedocs.io
BSD 3-Clause "New" or "Revised" License
139 stars 57 forks source link

synthseg still crashing/ Add synthseg csv to results #613

Open araikes opened 1 year ago

araikes commented 1 year ago

Hi @mattcieslak,

I pulled 0.19.0 and am testing it. SynthSeg is still crashing as it was in #598. A quick check of the error file suggests that the threads are still inheriting --omp-nthreads (see the commandline call in the error).

Explicitly setting --omp-nthreads 1 enables SynthSeg to finish but obviously negates the multi-threading that other processes can take advantage of.

singularity run --containall --writable-tmpfs -B $PWD/nifti:$PWD/nifti:ro \
-B $PWD/derivatives/qsiprep:$PWD/derivatives/qsiprep \
-B /home/centos/singularity_images/license.txt:/license.txt \
-B /tmp:/tmp \
$QSI \
$PWD/nifti \
$PWD/derivatives/qsiprep \
participant 
--participant-label 0024 \
--output-resolution 1 \
--denoise-method patch2self \
--unringing-method mrdegibbs \
--b1-biascorrect-stage final \
-w /tmp \
--fs-license-file /license.txt \
--nthreads 4 --omp-nthreads 2

...

230813-23:26:23,99 nipype.workflow INFO:
         Running with omp_nthreads=2, nthreads=4
230813-23:26:23,99 nipype.workflow IMPORTANT:

    Running qsiprep version 0.19.0:
      * BIDS dataset path: /home/Public/<redacted>/nifti.
      * Participant list: ['0024'].
      * Run identifier: 20230813-232622_4c64af0b-392b-49ae-8aef-e37432cbd78f.

...

230813-23:20:56,601 nipype.workflow ERROR:
         Node synthseg failed to run on host ip-172-31-13-236.us-east-2.compute.internal.
230813-23:20:56,605 nipype.workflow ERROR:
         Saving crash info to /home/Public/<redacted>/derivatives/qsiprep/qsiprep/sub-0024/log/20230813-231752_3a393681-f2b9-417c-9e88-bd462d9c4e47/crash-20230813-232056-qsiprep-synthseg-56880fd9-4d14-4532-81ff-610265b4159c.txt
Traceback (most recent call last):
  File "/usr/local/miniconda/lib/python3.8/site-packages/nipype/pipeline/plugins/multiproc.py", line 67, in run_node
    result["result"] = node.run(updatehash=updatehash)
  File "/usr/local/miniconda/lib/python3.8/site-packages/nipype/pipeline/engine/nodes.py", line 527, in run
    result = self._run_interface(execute=True)
  File "/usr/local/miniconda/lib/python3.8/site-packages/nipype/pipeline/engine/nodes.py", line 645, in _run_interface
    return self._run_command(execute)
  File "/usr/local/miniconda/lib/python3.8/site-packages/nipype/pipeline/engine/nodes.py", line 771, in _run_command
    raise NodeExecutionError(msg)
nipype.pipeline.engine.nodes.NodeExecutionError: Exception raised while executing Node synthseg.

Cmdline:
        mri_synthseg --i /tmp/qsiprep_wf/single_subject_0024_wf/anat_preproc_wf/pad_anat_reference_wf/resample_skulled_to_reference/sub-0024_ses-11_T1w_lps_trans.nii.gz --threads 2 --post sub-0024_ses-11_T1w_lps_trans_post.nii.gz --qc sub-0024_ses-11_T1w_lps_trans_qc.csv --o sub-0024_ses-11_T1w_lps_trans_aseg.nii.gz
Stdout:

Stderr:
        Aborted
Traceback:
        Traceback (most recent call last):
          File "/usr/local/miniconda/lib/python3.8/site-packages/nipype/interfaces/base/core.py", line 453, in aggregate_outputs
            setattr(outputs, key, val)
          File "/usr/local/miniconda/lib/python3.8/site-packages/nipype/interfaces/base/traits_extension.py", line 330, in validate
            value = super(File, self).validate(objekt, name, value, return_pathlike=True)
          File "/usr/local/miniconda/lib/python3.8/site-packages/nipype/interfaces/base/traits_extension.py", line 135, in validate
            self.error(objekt, name, str(value))
          File "/usr/local/miniconda/lib/python3.8/site-packages/traits/base_trait_handler.py", line 74, in error
            raise TraitError(
        traits.trait_errors.TraitError: The 'out_post' trait of a _SynthSegOutputSpec instance must be a pathlike object or string representing an existing file, but a value of '/tmp/qsiprep_wf/single_subject_0024_wf/anat_preproc_wf/synthseg_anat_wf/synthseg/sub-0024_ses-11_T1w_lps_trans_post.nii.gz' <class 'str'> was specified.

        During handling of the above exception, another exception occurred:

        Traceback (most recent call last):
          File "/usr/local/miniconda/lib/python3.8/site-packages/nipype/interfaces/base/core.py", line 400, in run
            outputs = self.aggregate_outputs(runtime)
          File "/usr/local/miniconda/lib/python3.8/site-packages/nipype/interfaces/base/core.py", line 460, in aggregate_outputs
            raise FileNotFoundError(msg)
        FileNotFoundError: No such file or directory '/tmp/qsiprep_wf/single_subject_0024_wf/anat_preproc_wf/synthseg_anat_wf/synthseg/sub-0024_ses-11_T1w_lps_trans_post.nii.gz' for output 'out_post' of a SynthSeg interface
araikes commented 1 year ago

So, oddly, I'm able to run the same call on the same data on a different system and have it complete without issue.

cookpa commented 1 year ago

So, oddly, I'm able to run the same call on the same data on a different system and have it complete without issue.

Are the CPUs different? I'm wondering if tensorflow is optimizing on the fly for different machines and using more memory on some of them

araikes commented 1 year ago

I think it's a couple of things (mostly the combination of 3-5):

  1. Different CPUs (AWS: AMD EPYC 7R32 with 16 cores vs HPC: AMD EPYC 7642 48-core processor x2)
  2. Different RAM (AWS: 30Gb total vs HPC: 503Gb total @ 5Gb/core)
  3. Threading in synthseg. Watching htop on the HPC, mri_synthseg is still inheriting multithreading instead of being single threaded.
  4. Singularity quirk and synthseg's resources usage is out of control. I spun up an HPC job with 12 cores and 5Gb RAM/core (should be 60Gb available to me). Singularity is very clearly making use of all of the memory on the system rather than respecting the job allocation.
  5. QSIPrep's --mem-mb and --low-mem clearly don't work as described. On the HPC, I ran QSIPrep with --nthreads 12 --omp-nthreads 8 --mem-mb 100 and synthseg finished without any complaint but see screenshot below for resource use in synthseg. Same result for --low-mem. --mem-mb 100 should crash everything almost immediately and yet completes without issue

image

Error log from AWS attempt:

Node: qsiprep_wf.single_subject_0024_wf.anat_preproc_wf.synthseg_anat_wf.synthseg
Working directory: /tmp/qsiprep_wf/single_subject_0024_wf/anat_preproc_wf/synthseg_anat_wf/synthseg

Node inputs:

args = <undefined>
environ = {'OMP_NUM_THREADS': '1'}
fast = False
input_image = /tmp/qsiprep_wf/single_subject_0024_wf/anat_preproc_wf/pad_anat_reference_wf/resample_skulled_to_reference/sub-0024_ses-11_T1w_lps_trans.nii.gz
num_threads = 8
out_post = <undefined>
out_qc = <undefined>
out_seg = <undefined>
robust = <undefined>
subjects_dir = <undefined>

Traceback (most recent call last):
  File "/usr/local/miniconda/lib/python3.8/site-packages/nipype/pipeline/plugins/multiproc.py", line 67, in run_node
    result["result"] = node.run(updatehash=updatehash)
  File "/usr/local/miniconda/lib/python3.8/site-packages/nipype/pipeline/engine/nodes.py", line 527, in run
    result = self._run_interface(execute=True)
  File "/usr/local/miniconda/lib/python3.8/site-packages/nipype/pipeline/engine/nodes.py", line 645, in _run_interface
    return self._run_command(execute)
  File "/usr/local/miniconda/lib/python3.8/site-packages/nipype/pipeline/engine/nodes.py", line 771, in _run_command
    raise NodeExecutionError(msg)
nipype.pipeline.engine.nodes.NodeExecutionError: Exception raised while executing Node synthseg.

Cmdline:
        mri_synthseg --i /tmp/qsiprep_wf/single_subject_0024_wf/anat_preproc_wf/pad_anat_reference_wf/resample_skulled_to_reference/sub-0024_ses-11_T1w_lps_trans.nii.gz --threads 8 --post sub-0024_ses-11_T1w_lps_trans_post.nii.gz --qc sub-0024_ses-11_T1w_lps_trans_qc.csv --o sub-0024_ses-11_T1w_lps_trans_aseg.nii.gz
Stdout:

Stderr:
        Aborted
Traceback:
        Traceback (most recent call last):
          File "/usr/local/miniconda/lib/python3.8/site-packages/nipype/interfaces/base/core.py", line 453, in aggregate_outputs
            setattr(outputs, key, val)
          File "/usr/local/miniconda/lib/python3.8/site-packages/nipype/interfaces/base/traits_extension.py", line 330, in validate
            value = super(File, self).validate(objekt, name, value, return_pathlike=True)
          File "/usr/local/miniconda/lib/python3.8/site-packages/nipype/interfaces/base/traits_extension.py", line 135, in validate
            self.error(objekt, name, str(value))
          File "/usr/local/miniconda/lib/python3.8/site-packages/traits/base_trait_handler.py", line 74, in error
            raise TraitError(
        traits.trait_errors.TraitError: The 'out_post' trait of a _SynthSegOutputSpec instance must be a pathlike object or string representing an existing file, but a value of '/tmp/qsiprep_wf/single_subject_0024_wf/anat_preproc_wf/synthseg_anat_wf/synthseg/sub-0024_ses-11_T1w_lps_trans_post.nii.gz' <class 'str'> was specified.

        During handling of the above exception, another exception occurred:

        Traceback (most recent call last):
          File "/usr/local/miniconda/lib/python3.8/site-packages/nipype/interfaces/base/core.py", line 400, in run
            outputs = self.aggregate_outputs(runtime)
          File "/usr/local/miniconda/lib/python3.8/site-packages/nipype/interfaces/base/core.py", line 460, in aggregate_outputs
            raise FileNotFoundError(msg)
        FileNotFoundError: No such file or directory '/tmp/qsiprep_wf/single_subject_0024_wf/anat_preproc_wf/synthseg_anat_wf/synthseg/sub-0024_ses-11_T1w_lps_trans_post.nii.gz' for output 'out_post' of a SynthSeg interface
mattcieslak commented 1 year ago

Thanks for the detailed report @araikes. I can confirm that --low-mem and --mem-gb are not respected in most of qsiprep.

The fact that synthseg is still getting --threads=2 in its commandline is surprising. I hard code the value to 1 here but the value in the interface must be getting overwritten by the value in the node. The value is set to omp-nthreads in the node to prevent too many other tasks from running at the same time, which can also cause memory issues. Would you be up for testing a patch from the unstable tag @araikes ?

araikes commented 1 year ago

Yeah, I can test it

araikes commented 1 year ago

It made it past synthseg on AWS using the unstable tag. However, it appears that it still attempting to allocate 8 processors to synthseg (even if it only ended up using 1)

230816-23:33:32,692 nipype.workflow INFO:
         [MultiProc] Running 1 tasks, and 1 jobs ready. Free memory (GB): 23.24/23.44, Free processors: 4/12.
                     Currently running:
                       * qsiprep_wf.single_subject_0024_wf.anat_preproc_wf.synthseg_anat_wf.synthseg
230816-23:43:23,276 nipype.workflow INFO:
         [Node] Finished "synthseg", elapsed time 596.488982s.
230816-23:43:25,240 nipype.workflow INFO:
         [Job 21] Completed (qsiprep_wf.single_subject_0024_wf.anat_preproc_wf.synthseg_anat_wf.synthseg).

Just waiting for it to finish the normalization workflow at this point...

mattcieslak commented 1 year ago

yay! The that's how I was hoping it would work. Early in the pipeline there are a lot of high-memory things happening. You assigned omp-nthreads to 8 here?

araikes commented 1 year ago

Yeah, I threw 8 in just to challenge things.

On our AWS server, anatomical+diffusion processing worked without issue on the unstable tag.

mattcieslak commented 1 year ago

perfect. This will be in 0.19.1

araikes commented 1 year ago

Question: What happens to the QC CSV from SynthSeg? It's getting written but I don't see it as a file anywhere in my outputs.

mattcieslak commented 1 year ago

that would be great to add to the anatomical derivatives. Sounds like another good contender for 0.19.1