PennLINC / xcp_d

Post-processing of fMRIPrep, NiBabies, and HCP outputs
https://xcp-d.readthedocs.io
BSD 3-Clause "New" or "Revised" License
78 stars 26 forks source link

XCP-D v0.10.0rc1 and running into crashes on the HPC but not my local linux box #1288

Closed dkp closed 1 month ago

dkp commented 1 month ago

Summary

I am trying XCP-D v0.10.0rc1 and running into crashes on the HPC but not my local linux box. Version xcp-d_v0.8.3.sif runs without issue in both environments on this same fmriprep dataset (though 0.83 requires --file-format cifti --warp-surfaces-native2std)

See attached slurm log and crash report

Additional details

Input data:

fmriprep 24.1.1 (run as follows):

apptainer run --cleanenv --bind ${MRIS}/data:/data:ro --bind ${APP_DERIV_DIR}:/outputs --bind ${WORK_DIR}:/work ${APP} /data /outputs participant --participant_label ${Subject} --fs-license-file ${HOME}/license.txt -w /work --stop-on-first-crash --ignore slicetiming --cifti-output 91k --output-spaces fsLR fsnative fsaverage MNI152NLin6Asym:res-2

What were you trying to do?

XCP-D command (same on both systems):

# Minimal linc run for XCPD-10.1

apptainer run --cleanenv --bind ${FMRIPREP_DERIV_DIR}:/fmriprep:ro --bind ${WORK_DIR}:/work --bind ${XCPD_DERIV_DIR}:/out ${APP} /fmriprep /out participant --participant_label ${Subject} --fs-license-file ${HOME}/license.txt --mode linc --stop-on-first-crash --head_radius 50 -w /work

What did you expect to happen?

I expected the 0.10.0 pipeline to run in both environments just like the 0.83 pipeline before it

What actually happened?

The 10.0 pipeline ran correctly on the local linux box, but failed, with the same call and same data on the HPC.

Reproducing the bug

This seems to be specific to some interaction with the HPC that has changed between XCP-D version 0.83 and version 0.10.0. I have not tested intermediate versions.

crash-20241013-113252-dkp-surface_sphere_project_unproject-636e1e22-8be1-435f-b46f-0e610f0d122a.txt

slurm-xcpdfail.txt

mattcieslak commented 1 month ago

Hi @dkp!

I ran into this exact same issue with qsiprep awhile back. It's the ABI tags in the libQt5 library. It produces a very tricky error message that says the library isn't there when it is - the host system just can't load it because of those tags.

@tsalo here is where the tags get stripped out. Does this happen in the xcpd build?

tsalo commented 1 month ago

It does not, but I can add it. Thanks!

tsalo commented 1 month ago

I just merged #1293, which should hopefully fix the problem. @dkp once pennlinc/xcp_d:unstable updates on DockerHub (should happen in ~2 hours), would you be willing to try it out on your HPC?

dkp commented 1 month ago

Thank you, I will try ASAP (hopefully today, and will let you know as soon as I have results)

dkp commented 1 month ago

I ran it a couple of ways. But, the most recent was a clean run with no work directory and no previous derivatives.
It worked! The output looks appropriate and complete (from skimming it) and slurm reports success. Yay!! Thank you.