FCP-INDI / C-PAC

Configurable Pipeline for the Analysis of Connectomes
https://fcp-indi.github.io/
GNU Lesser General Public License v3.0
62 stars 40 forks source link

πŸ› Processing freezes when multiple cores are assigned #2135

Open Pneumaethylamine opened 1 month ago

Pneumaethylamine commented 1 month ago

Describe the bug

Hi, I have managed to run the default pipeline on a single subject using 1GB of RAM and 1 core, now I'd like to speed things up, my server has 800gb of ram and 38 cores but when I try and run the pipeline with 2 or more cores it runs for a few hours and then just stops at some point. There is no crash message and nothing in the log, it just stops doing anything.

This is how I initiate it

apptainer run --env FSLDIR=/usr/share/fsl/6.0 \ --env FSL_DIR=/usr/share/fsl/6.0 \ --env FSL_BIN=/usr/share/fsl/6.0/bin \ -B /mrdata/cpac_test/NeuroPharm2-P2:/bidsdata \ -B /mrdata/cpac_test/out_minimal_fast:/output \ -B /mrdata/cpac_test/configs:/pipeline_dir \ c-pac_latest.sif /bidsdata /output participant --participant_label sub-56061 --pipeline_file /pipeline_dir/pipeline_config_minimal_single.yml

and my yml is below.

I'm sorry I can't share my data, but the pipeline works fine when I change just the following options to all be 1

max_cores_per_participant: 16

num_ants_threads: 8

num_OMP_threads: 8

The last few lines of the callback.log offer no clues, it seems to be well within its RAM limits and when I check my server status it is not even close to having its limits pushed. There is a crash early in the process, something to do with not being able to pypeline.log.lock but this seems unrelated.

Occasionally it says "No resources available, potential deadlock" see screenshot 1) but then it starts running again presumably once other processes finish. As you can see in screenshot 2 there is no error message but the process has been frozen for 4 hours.

Thank you very much for the help!

crash-20240722-124119-drummondmcculloch-resampled_T1w-brain-template-20e94696-d2d1-4d41-9a1a-8948c82559fd.txt callback.log pypeline.log

To reproduce

No response

Preconfig

Custom pipeline configuration

pipeline_config.docx

Run command

apptainer run --env FSLDIR=/usr/share/fsl/6.0 \ --env FSL_DIR=/usr/share/fsl/6.0 \ --env FSL_BIN=/usr/share/fsl/6.0/bin \ -B /mrdata/cpac_test/NeuroPharm2-P2:/bidsdata \ -B /mrdata/cpac_test/out_minimal_fast:/output \ -B /mrdata/cpac_test/configs:/pipeline_dir \ c-pac_latest.sif /bidsdata /output participant --participant_label sub-56061 --pipeline_file /pipeline_dir/pipeline_config_minimal_single.yml

Expected behavior

The expected outputs, and for it to run without stopping

Acceptance criteria

It to run without stopping

Screenshots

Screenshot 2024-07-22 125410 Screenshot 2024-07-22 174927

C-PAC version

v1.8.7.dev1

Container platform

No response

Docker and/or Singularity version(s)

Apptainer 1.1.6-bp155.2.18

Additional context

No response

Pneumaethylamine commented 1 month ago

Looking further into the error it seems to occur during the nipype draw-gantt_chart function which returns an error when start-time is TRUE indicating that "Two processes started at exactly the same time" which then freezes all other processes. I think this might be occurring because of the Permission error not allowing the program to lock the pypeline file. Not sure.

PermissionError: [Errno 13] Permission denied: '/output/log/pipeline_cpac-test_custom_pipeline/sub-56061_ses-01/pypeline.log.lock'

"https://nipype.readthedocs.io/en/latest/api/generated/nipype.utils.draw_gantt_chart.html"

tamsinrogers commented 1 month ago

Hi @Pneumaethylamine, thanks for reaching out.

We ran a similar pipeline config on our data and couldn’t replicate this problem. Could you share some additional details about your data so we can continue troubleshooting? Knowing the size of the files you are working with, and what kinds of files (T1, T2, etc.) would be helpful - something like the result of tree β€”du -h in your terminal, if you are free to share that level of detail about the data.

Pneumaethylamine commented 1 month ago

Thanks for the fast response, I am just running on a single test-subject at the moment, the files look like this

(base) drummondmcculloch@bohr:/mrdata/cpac_test/NeuroPharm2-P2/sub-56061> tree -h . └── [ 111] ses-01 β”œβ”€β”€ [4.0K] anat β”‚Β Β  β”œβ”€β”€ [1.9K] sub-56061_ses-01_acq-mprage_rec-normdiscorr2d_T1w.json β”‚Β Β  β”œβ”€β”€ [ 13M] sub-56061_ses-01_acq-mprage_rec-normdiscorr2d_T1w.nii.gz β”‚Β Β  β”œβ”€β”€ [2.0K] sub-56061_ses-01_acq-mprage_rec-norm_T1w.json β”‚Β Β  β”œβ”€β”€ [ 13M] sub-56061_ses-01_acq-mprage_rec-norm_T1w.nii.gz β”‚Β Β  β”œβ”€β”€ [1.9K] sub-56061_ses-01_acq-space_rec-normdiscorr2d_T2w.json β”‚Β Β  β”œβ”€β”€ [ 10M] sub-56061_ses-01_acq-space_rec-normdiscorr2d_T2w.nii.gz β”‚Β Β  β”œβ”€β”€ [2.0K] sub-56061_ses-01_acq-space_rec-norm_T2w.json β”‚Β Β  └── [ 11M] sub-56061_ses-01_acq-space_rec-norm_T2w.nii.gz β”œβ”€β”€ [4.0K] fmap β”‚Β Β  β”œβ”€β”€ [2.3K] sub-56061_ses-01_acq-greLR_magnitude1.json β”‚Β Β  β”œβ”€β”€ [159K] sub-56061_ses-01_acq-greLR_magnitude1.nii.gz β”‚Β Β  β”œβ”€β”€ [2.3K] sub-56061_ses-01_acq-greLR_magnitude2.json β”‚Β Β  β”œβ”€β”€ [158K] sub-56061_ses-01_acq-greLR_magnitude2.nii.gz β”‚Β Β  β”œβ”€β”€ [2.3K] sub-56061_ses-01_acq-greLR_phasediff.json β”‚Β Β  └── [201K] sub-56061_ses-01_acq-greLR_phasediff.nii.gz β”œβ”€β”€ [ 126] func β”‚Β Β  β”œβ”€β”€ [2.5K] sub-56061_ses-01_task-rest_acq-ep2d_bold.json β”‚Β Β  └── [ 37M] sub-56061_ses-01_task-rest_acq-ep2d_bold.nii.gz β”œβ”€β”€ [ 110] pet β”‚Β Β  β”œβ”€β”€ [7.2K] sub-56061_ses-01_trc-CIMBI36_pet.json β”‚Β Β  └── [162M] sub-56061_ses-01_trc-CIMBI36_pet.nii.gz └── [ 672] sub-56061_ses-01_scans.tsv

I ran it again last night with 1 core and it went totally smoothly, though it does also crash if I try and run with freesurfer options on, and only a single core i.e.,

surface_analysis:

Run freesurfer_abcd_preproc to obtain preprocessed T1w for reconall abcd_prefreesurfer_prep: run: On

Will run Freesurfer for surface-based analysis. Will output traditional Freesurfer derivatives. If you wish to employ Freesurfer outputs for brain masking or tissue segmentation in the voxel-based pipeline,

select those 'Freesurfer-' labeled options further below in anatomical_preproc. freesurfer:

run_reconall: On

Ingress freesurfer recon-all folder ingress_reconall: Off Add extra arguments to recon-all command reconall_args: Run ABCD-HCP post FreeSurfer and fMRISurface pipeline post_freesurfer:

run: On

subcortical_gray_labels: /opt/dcan-tools/pipeline/global/config/FreeSurferSubcorticalLabelTableLut.txt

freesurfer_labels: /opt/dcan-tools/pipeline/global/config/FreeSurferAllLut.txt

surf_atlas_dir: /opt/dcan-tools/pipeline/global/templates/standard_mesh_atlases

gray_ordinates_dir: /opt/dcan-tools/pipeline/global/templates/Greyordinates

gray_ordinates_res: 2

high_res_mesh: 164

low_res_mesh: 32

fmri_res: 2

smooth_fwhm: 2

surface_connectivity: run: On surface_parcellation_template: /cpac_templates/Schaefer2018_200Parcels_17Networks_order.dlabel.nii

Where the error message is attached, perhaps it is a totally separate issue (in which case I can open another thread) but a bit of a google suggested it might be a resources issue.

crash-20240722-192830-drummondmcculloch-post_freesurfer_303.a0-af16faed-d224-4e2a-89ae-65c5feea3d3f.txt

Thank you very much!! Drummond

Pneumaethylamine commented 1 month ago

Small update, I tried with an open dataset https://openneuro.org/datasets/ds000030/versions/00016/download and got the same error.

tamsinrogers commented 1 month ago

Hi Drummond,

Thank you for providing the additional info. As next steps, can you try the following so we can better investigate what might be going on?

  1. Running a new run with a clean working and output directory, using some arbitrary other number of threads.
  2. Turning off writing to log for now (avoiding the lock).
    pipeline_setup:
    log_directory:
    run_logging: Off
Pneumaethylamine commented 1 month ago

pipeline_config_single_base_fast_nolog.txt

apptainer run --containall \ -B /mrdata/cpac_test/NeuroPharm2-P2:/bidsdata \ -B /mrdata/cpac_test/outputs/out_nolog:/output \ -B /mrdata/cpac_test/configs:/pipeline_dir \ -B /mrdata/cpac_test/tmp:/tmp \ c-pac_latest.sif /bidsdata /output participant --participant_label sub-56061 --pipeline_file /pipeline_dir/pipeline_config_single_base_fast_nolog.yml

I made a new output directory so there was nothing in the log or working dirs.

It crashed again with this yml with run_logging off.

Curiously, it still produced the log.lock errors (see below)

crash-20240729-115205-c-pac_user-resampled_T1w-template-symmetric-d777a902-1478-42c9-a59e-ac92d0f17aad.txt crash-20240729-115205-c-pac_user-resampled_T1w-brain-template-deriv-10c15264-b286-4bde-83b1-423daf02a80a.txt

and here is the output in my terminal from when it died, it had been still for almost an hour when I ctrl+C'd to kill it.

crash_outputs.txt

Pneumaethylamine commented 1 month ago

I am working with my server IT team to try and diagnose the issue, could there be something with how the MultiProc is set up on the server that could be the issue? Thanks again.

Pneumaethylamine commented 1 month ago

Hi, Is there anything else I can provide to help debug this? Thank you. Possibly unrelated but when I set num_participant_to_run_at_once to >1 it still only runs 1 at a time.

tamsinrogers commented 1 month ago

Hi @Pneumaethylamine, could you provide the job scheduler you're using?

Pneumaethylamine commented 1 month ago

Hi Tamsin, We don't have a separate job scheduler, we just use the Linux kernel task scheduler. Thanks :)

tamsinrogers commented 1 month ago

Hi @Pneumaethylamine,

We've determined that this is the root of the crash. It looks like you may note have write permission for that file. If you still has the outputs from the run you attached the crashlog from, could check the permissions in /mrdata/cpac_test/out_minimal_fast/log/pipeline_cpac-test_custom_pipeline/sub-56061_ses-01/ and see if anything is looks strange.

Something else we can look into - is your system using SELinux or AppArmor or something similar? A review of your Apptainer settings may also reveal some helpful information.

Pneumaethylamine commented 1 month ago

Hi Tamsin,

Thanks for getting to the root of it. Is there something I should be setting up re permissions before running apptainer? Below is my permissions overview for the log folder. I thought the permissions issue in the pipeline.log file was due to several processes trying to write to it at the same time resulting in the log.lock error from before.

(base) drummondmcculloch@hevesy:/mrdata/cpac_test/outputs/out_minimal_fast/log/pipeline_cpac-test_custom_pipeline/sub-56061_ses-01> ls -l total 9036 -rw-r--r-- 1 drummondmcculloch users 513438 Jul 22 13:08 callback.log -r--r--r-- 1 drummondmcculloch users 14903 Jul 22 12:40 cpac_data_config_6c29c2e9_2024-07-22T12-40-48Z.yml -r--r--r-- 1 drummondmcculloch users 14903 Jul 22 12:42 cpac_data_config_6c29c2e9_2024-07-22T12-42-06Z.yml -r--r--r-- 1 drummondmcculloch users 12714 Jul 22 12:40 cpac_pipeline_config_6c29c2e9_2024-07-22T12-40-48Z_min.yml -r--r--r-- 1 drummondmcculloch users 66311 Jul 22 12:40 cpac_pipeline_config_6c29c2e9_2024-07-22T12-40-48Z.yml -r--r--r-- 1 drummondmcculloch users 12650 Jul 22 12:42 cpac_pipeline_config_6c29c2e9_2024-07-22T12-42-06Z_min.yml -r--r--r-- 1 drummondmcculloch users 66312 Jul 22 12:42 cpac_pipeline_config_6c29c2e9_2024-07-22T12-42-06Z.yml -rw-r--r-- 1 drummondmcculloch users 1206393 Jul 22 13:52 pypeline.log -rw-r--r-- 1 drummondmcculloch users 7552 Jul 22 12:42 sub-56061_ses-01_expectedOutputs.yml -rw-r--r-- 1 drummondmcculloch users 7552 Jul 22 12:41 sub-56061_ses-01_missingOutputs.yml -rw-r--r-- 1 drummondmcculloch users 3650051 Jul 22 12:41 workflow_pre_2024-07-22_12-40-59_pipeline_cpac-test_custom_pipeline.json -rw-r--r-- 1 drummondmcculloch users 3650051 Jul 22 12:42 workflow_pre_2024-07-22_12-42-17_pipeline_cpac-test_custom_pipeline.json

As for apptainer, I am using version 1.1.6-bp155.2.18 with no plugins installed and the following cloud service endpoints

Cloud Services Endpoints

NAME URI ACTIVE GLOBAL EXCLUSIVE INSECURE DefaultRemote cloud.apptainer.org YES YES NO NO

Keyservers

URI GLOBAL INSECURE ORDER https://keys.openpgp.org YES NO 1*

Are there other specific settings you would like me to check?

Thank you very much for your time helping with this, Drummond

Pneumaethylamine commented 1 month ago

We are running AppArmour and something called "lockdown". I know nothing about how these are set up but I've messaged IT to try learn more.

Pneumaethylamine commented 4 weeks ago

I tried disabling AppArmor and running it again, and it made no difference. We aren't running SELinux. Thanks again for your help.

tamsinrogers commented 4 weeks ago

Hi @Pneumaethylamine,

The lock is designed to β€œsimply watch[…] the existence of the lock file” if the log is being written to avoid multiple processes writing at the same time.

It seems that a process in the container is creating log/pipeline_cpac-test_custom_pipeline/sub-56061_ses-01/pypeline.log.lock, and the whole tree of log/pipeline_cpac-test_custom_pipeline/sub-56061_ses-01, but somehow doesn’t then have write access to that file, despite you having write access to pypeline.log.

Since you’ve already tried disabling AppArmor, as a next step, can you try running C-PAC as root and then updating the output file permissions at the end? Another thing you can try in the meantime is running some subjects single-threaded.

Pneumaethylamine commented 3 weeks ago

Hi Tamsin,

I tried to run the command with "fakeroot" i.e.,

apptainer run --fakeroot --containall \ -B /mrdata/cpac_test/NeuroPharm2-P2:/bidsdata \ -B /mrdata/cpac_test/outputs/out_fakeroot:/output \ -B /mrdata/cpac_test/configs:/pipeline_dir \ -B /mrdata/cpac_test/tmp:/tmp \ c-pac_latest.sif /bidsdata /output participant --participant_label sub-56061 --pipeline_file /pipeline_dir/pipeline_config_single_base_fast2.yml

I no longer got the crash about pipeline.log.lock but the processing still stalled in the same place as you can see in the pipeline.log file

image

I have been running subjects single threaded which works fine, but is just very slow. I have written a shell script to open multiple instances simultaneously, but this limits group-level analyses, and I am also still unable to run surface-level analyses, and I wonder if this problem is related. Please let me know what other steps I can take to try and resolve this.

Thanks, Drummond

tamsinrogers commented 3 weeks ago

Hi @Pneumaethylamine ,

Can you try:

apptainer \
-B /mrdata/cpac_test/NeuroPharm2-P2:/bidsdata \
-B /mrdata/cpac_test/out_minimal_fast:/output \
-B /mrdata/cpac_test/configs:/pipeline_dir \
shell c-pac_latest.sif

to shell into a container? Then, in the container, you can run the command that’s crashing:

bash /code/CPAC/surface/fMRISurface/run.sh --post_freesurfer_folder /output/working/pipeline_cpac-test_custom_pipeline/cpac_sub-56061_ses-01/post_freesurfer_303 --subject sub-56061_ses-01 --fmri /output/working/pipeline_cpac-test_custom_pipeline/cpac_sub-56061_ses-01/warp_ts_to_T1template_287/_scan_rest_acq-ep2d/apply_warp_warp_ts_to_T1template_287/bandpassed_demeaned_filtered_warp_trans.nii.gz --scout /output/working/pipeline_cpac-test_custom_pipeline/cpac_sub-56061_ses-01/warp_ts_to_T1template_287/_scan_rest_acq-ep2d/apply_warp_warp_ts_to_T1template_287/bandpassed_demeaned_filtered_warp_trans.nii.gz --lowresmesh 32 --grayordinatesres 2 --fmrires 2 --smoothingFWHM 2

Hopefully that way, we can see more specifically what’s going wrong.

Pneumaethylamine commented 3 weeks ago

Hi Tamsin,

Strange thing, after I posted that screenshot of it hanging at 12:12:35 it briefly restarted around 9 hours later, ran for for 90 seconds and then hung again, now in a different place. It has now been stuck for 36 hours so probably won't restart again.

image

We have checked the server logs and there's nothing to suggest any resource limitations or that any process was stopped/started at this time.

When I run multi-thread with surface analyses on it hangs at a different point

image

Finally, when I run surface analyses single threaded it dies in the same place highlighted above, and then when I go in and run the appropriate command I get the attached error.

surface_single_error.txt

There seem to be two problems

1: The atlas file and the functional data have different voxel dims as you can see in this image and explaining this error, also the FoV of the functional image doesn't extend down the brainstem.

While running: wb_command -cifti-create-dense-timeseries /output/working/pipeline_cpac-test_custom_pipeline/cpac_sub-56061_ses-01/post_freesurfer_282/MNINonLinear/Results/task-rest01/task-rest01_temp_subject.dtseries.nii -volume /output/working/pipeline_cpac-test_custom_pipeline/cpac_sub-56061_ses-01/post_freesurfer_282/MNINonLinear/Results/task-rest01/task-rest01.nii.gz /output/working/pipeline_cpac-test_custom_pipeline/cpac_sub-56061_ses-01/post_freesurfer_282/MNINonLinear/ROIs/Atlas_ROIs.2.nii.gz

ERROR: label volume has a different volume space than data volume

image

Secondly the file "task-rest01_AtlasSubcortical_s2.nii.gz" doesn't exist presumably because it wasn't created in the previous step.

I have attached the .yml I am using for the surface analyses single threaded. surface_full_keep_working_single_txt.txt

I hope this is somehow clarifying! Thanks again for the help.

Pneumaethylamine commented 3 weeks ago

When I kill the hanging processes with ctrl+c they output the following text. In case that is useful. Thanks again.

ctrlC_base_multi.txt ctrlC_surface_multi.txt

Pneumaethylamine commented 1 week ago

Hi, any update on this? Thanks.

tamsinrogers commented 1 week ago

Hi @Pneumaethylamine, thank you for the additional information! We are replicating on our end and will hopefully have some more clarification for you soon.

Not sure how helpful this will be as you are mid-run, but for the future, using pre-computed freesurfer directories as opposed to running freesurfer in cpac will usually result in fewer debugging issues.

Will get back to you as soon as possible!