very slow start time when parallelizing

Remi-Gau commented 9 months ago

Testing it on our large test nodes, the commands seem to work quite well for a single subject would like to parallelize them to process my entire study. participants each have around 30 sessions. Attempting to parallelize each subject on our GPU clusters appears to fail, the jobs keep getting killed due to being out of memory. In fact, BIDSMREYE seems to take an extremely long time just to begin, about several hours for the job to begin.

#!/bin/bash -l

#SBATCH --job-name=[bidsmreye]
#SBATCH -o log/bidsmreye_%a.txt
#SBATCH -e log/bidsmreye_%a.err
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=8G
#SBATCH --account=DBIC
#SBATCH --partition=gpuq
#SBATCH --gres=gpu:2
#SBATCH --time=7-01:00:00
#SBATCH --mail-type=FAIL,END
#SBATCH --requeue
#SBATCH --array=0-11

# Output and error log directories
output_log_dir="log"
error_log_dir="log"

# Create the directories if they don't exist
mkdir -p "$output_log_dir"
mkdir -p "$error_log_dir"

# Must run on a GPU node
module load cuda
module load TensorRT
nvidia-smi
echo $CUDA_VISIBLE_DEVICES
hostname

# bidsmreye requires input fmridata (fmriprep outputs) to be at least realigned
# Filenames and structure that conforms to a BIDS derivative dataset

# Had to add these lines to initialize conda
conda init bash
source ~/.bashrc
conda activate deepmreye

# Check if SLURM_ARRAY_TASK_ID is not set or is empty
if [ -z "$SLURM_ARRAY_TASK_ID" ]; then
    # Set SLURM_ARRAY_TASK_ID to a default value, e.g., 1
    SLURM_ARRAY_TASK_ID=0
fi

bids_dir="/dartfs-hpc/rc/lab/C/CANlab/labdata/data/WASABI/derivatives/fmriprep-try2"
output_dir="/dartfs-hpc/rc/lab/C/CANlab/labdata/data/WASABI/derivatives/deepmreye"
SUBJECTS=(SID000002 SID000743 SID001567 SID001651 SID001804 SID001907 SID001641 SID001684 SID001852 SID002035 SID002263 SID002328)
SUBJ=${SUBJECTS[$SLURM_ARRAY_TASK_ID]}
echo "processing bidsmreye for ${SUBJ}..."

# Preparing the data, then Computing the eye movements (action prepare; action generalize)
# Prepare: registers the data to MNI if this is not the case already, registers the data the the deepmreye template, extracts data from the eyes mask
bidsmreye --action all \
    ${bids_dir} \
    ${output_dir} \
    participant --participant_label ${SUBJ} 

# Group Level Summary
bidsmreye --action qc \
    ${bids_dir} \
    ${output_dir} \
    participant --participant_label ${SUBJ} 

echo "processing complete"

github-actions[bot] commented 9 months ago

Thank you for your issue. Give us a little time to review it.

PS. You might want to check the FAQ if you haven't done so already.

This is an automated reply, generated by FAQtory

Michael-Sun commented 9 months ago

To further clarify this issue, this occurs when using the conda environment installed bidsmreye. The following messages appear before processing begins:

2023-09-18 12:41:18.717612: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-09-18 12:41:25.070354: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT

cpp-lln-lab / bidsMReye

very slow start time when parallelizing #167