PennLINC / qsiprep

Preprocessing of diffusion MRI
http://qsiprep.readthedocs.io
BSD 3-Clause "New" or "Revised" License
139 stars 57 forks source link

Xvfb processes persist after qsiprep exits #336

Closed cookpa closed 2 years ago

cookpa commented 2 years ago

We've been having an issue where qsirecon jobs run indefinitely, even after qsiprep exits. I ran ps after calling singularity run, and it seems that an xvfb process persists after singularity exits. Example:

Xvfb :1324577801 -screen 0 800x680x24 -nolisten tcp

This happens after running DSI-Studio --recon_only and recon spec

{
  "name": "sample_recon",
  "space": "T1w",
  "atlases": ["schaefer100x7"],
  "nodes": [
    {
      "name": "dsistudio_gqi",
      "software": "DSI Studio",
      "action": "reconstruction",
      "input": "qsiprep",
      "output_suffix": "gqi",
      "parameters": {"method": "gqi"}
    },
    {
      "name": "scalar_export",
      "software": "DSI Studio",
      "action": "export",
      "input": "dsistudio_gqi",
      "output_suffix": "gqiscalar"
    }
  ]
}

The qsiprep version is 0.14.3 and the system is a Linux HPC (@mattcieslak it's PMACS, happy to share more details and a runnable example if it would help).

Others investigating this have found that this only happens with bsub jobs, and doesn't happen if the job is run interactively. Therefore I'm thinking something to do with DISPLAY might avoid the issue.

Possibly related to https://github.com/PennLINC/qsiprep/issues/195

Possibly also related, it seems others have had trouble keeping track of xvfb processes https://github.com/nipy/nipype/issues/1403

cookpa commented 2 years ago

Tagging @jeffrey-phillips and @jeffduda who are also looking into this

mattcieslak commented 2 years ago

I'm working on updating the docker image and all the python dependencies, including nipype. If that doesn't fix this we should look into messing around with DISPLAY

mattcieslak commented 2 years ago

@cookpa could you try this with pennbbl/qsiprep:unstable? Lots of updates, maybe this will work now

cookpa commented 2 years ago

@mattcieslak unstable gives me an error. This is on data prepped with 0.14.3 - is unstable compatible with that?

labelconvert: uncompressing image "/tmp/qsirecon_wf/sub-999999_sample_recon/recon_wf/_dwi_file_..data..preprocess..sub-999999..ses-MR1..dwi..sub-999999_ses-MR1_acq-98dir_space-T1w_desc-preproc_dwi.nii.gz/get_atlases/schaefer100x7MNI_lps_to_dwi.nii.gz"... [==================================================]
labelconvert: Verifying parcellation image... [==================================================]
labelconvert: uncompressing image "/tmp/qsirecon_wf/sub-999999_sample_recon/recon_wf/_dwi_file_..data..preprocess..sub-999999..ses-MR1..dwi..sub-999999_ses-MR1_acq-98dir_space-T1w_desc-preproc_dwi.nii.gz/get_atlases/schaefer100x7MNI_lps_to_dwi.nii.gz"... [==================================================]
QSIPrep failed: Traceback (most recent call last):
  File "/usr/local/miniconda/lib/python3.8/site-packages/nipype/pipeline/plugins/multiproc.py", line 67, in run_node
    result["result"] = node.run(updatehash=updatehash)
  File "/usr/local/miniconda/lib/python3.8/site-packages/nipype/pipeline/engine/nodes.py", line 521, in run
    result = self._run_interface(execute=True)
  File "/usr/local/miniconda/lib/python3.8/site-packages/nipype/pipeline/engine/nodes.py", line 639, in _run_interface
    return self._run_command(execute)
  File "/usr/local/miniconda/lib/python3.8/site-packages/nipype/pipeline/engine/nodes.py", line 750, in _run_command
    raise NodeExecutionError(
nipype.pipeline.engine.nodes.NodeExecutionError: Exception raised while executing Node create_src.

RuntimeError: subprocess exited with code 127.

Traceback (most recent call last):
  File "/usr/local/miniconda/bin/qsiprep", line 8, in <module>
    sys.exit(main())
  File "/usr/local/miniconda/lib/python3.8/site-packages/qsiprep/cli/run.py", line 647, in main
    qsiprep_wf.run(**plugin_settings)
  File "/usr/local/miniconda/lib/python3.8/site-packages/nipype/pipeline/engine/workflows.py", line 638, in run
    runner.run(execgraph, updatehash=updatehash, config=self.config)
  File "/usr/local/miniconda/lib/python3.8/site-packages/nipype/pipeline/plugins/base.py", line 166, in run
    self._clean_queue(jobid, graph, result=result)
  File "/usr/local/miniconda/lib/python3.8/site-packages/nipype/pipeline/plugins/base.py", line 244, in _clean_queue
    raise RuntimeError("".join(result["traceback"]))
RuntimeError: Traceback (most recent call last):
  File "/usr/local/miniconda/lib/python3.8/site-packages/nipype/pipeline/plugins/multiproc.py", line 67, in run_node
    result["result"] = node.run(updatehash=updatehash)
  File "/usr/local/miniconda/lib/python3.8/site-packages/nipype/pipeline/engine/nodes.py", line 521, in run
    result = self._run_interface(execute=True)
  File "/usr/local/miniconda/lib/python3.8/site-packages/nipype/pipeline/engine/nodes.py", line 639, in _run_interface
    return self._run_command(execute)
  File "/usr/local/miniconda/lib/python3.8/site-packages/nipype/pipeline/engine/nodes.py", line 750, in _run_command
    raise NodeExecutionError(
nipype.pipeline.engine.nodes.NodeExecutionError: Exception raised while executing Node create_src.

RuntimeError: subprocess exited with code 127.
mattcieslak commented 2 years ago

Since the xvfb part seems ok now I'm going to close this issue

cookpa commented 2 years ago

The libQT fix made the recon workflow run again, and now the undead xvfb process is back.

cookpa commented 2 years ago

Here's an interesting twist. If I replace "singularity run [options] qsiprep.sif [qsiprep args]" with "singularity exec [options] xvfb-run qsiprep [qsiprep args]", the Xvfb process still outlives the call to singularity, but does not block job termination.

I also looked at nipype a little bit. Perhaps calling this function before exiting would fix the issue?

https://github.com/nipy/nipype/blob/6c060304f380c46b2f05c5afdc7171dbbdfadc58/nipype/utils/config.py#L370-L374

mattcieslak commented 2 years ago

I'm starting to suspect that the xvfb stuff is a result of calling the rendering code within a nipype SimpleInterface (which is python) and not through a CommandLine interface. For the CommandLine interfaces nipype should start and xvfb for the run and kill it after the run completes.

So I'm making a stand-alone program that does the plotting which will be xvfb-run by nipype. Also going to try to reduce the amount of memory required for this because it's way too much now