LHA3_R5_tiny Stitching cannot find .sessionID on proprietary cluster

Alex-de-Lecea commented 1 year ago

Bug report

Description of the problem

I am running the pipeline with the sample data provided in the paper, LHA3_R5_tiny, on my institution's cluster, Klone HYAK cluster. I am encountering issues during the stitching portion of the pipeline. I would appreciate any feedback on this issue.

Log file(s)

Error message

[- ] process > assign_spots - ERROR ~ Error executing process > 'stitching:stitch:run_czi2n5:spark_start_app (1)'

Caused by: Process stitching:stitch:run_czi2n5:spark_start_app (1) terminated with an error exit status (1)

Command executed:

echo "Starting the spark driver"

SESSION_FILE="/mmfs1/gscratch/scrubbed/alexidl/shared_work_dir/spark/LHA3_R5_tiny/.sessionId" echo "Checking for $SESSION_FILE" SLEEP_SECS=10 MAX_WAIT_SECS=1000 SECONDS=0

while ! test -e "$SESSION_FILE"; do sleep ${SLEEP_SECS} if (( ${SECONDS} < ${MAX_WAIT_SECS} )); then echo "Waiting for $SESSION_FILE" SECONDS=$(( ${SECONDS} + ${SLEEP_SECS} )) else echo "-------------------------------------------------------------------------------" echo "ERROR: Timed out after ${SECONDS} seconds while waiting for $SESSION_FILE " echo "Make sure that your --spark_work_dir is accessible to all nodes in the cluster " echo "-------------------------------------------------------------------------------" exit 1 fi done

if ! grep -F -x -q "5f44f17a-5b05-47a7-b1a7-c6aeea9a7b53" $SESSION_FILE then echo "------------------------------------------------------------------------------" echo "ERROR: session id in $SESSION_FILE does not match current session " echo "Make sure that your --spark_work_dir is accessible to all nodes in the cluster" echo "and that you are not running multiple pipelines with the same --spark_work_dir" echo "------------------------------------------------------------------------------" exit 1 fi

export SPARK_ENV_LOADED= export SPARK_HOME=/spark export PYSPARK_PYTHONPATH_SET= export PYTHONPATH="/spark/python" export SPARK_LOG_DIR="/mmfs1/gscratch/scrubbed/alexidl/shared_work_dir/spark/LHA3_R5_tiny"

. "/spark/sbin/spark-config.sh" . "/spark/bin/load-spark-env.sh"

SPARK_LOCAL_IP=hostname -i | rev | cut -d' ' -f1 | rev echo "Use Spark IP: $SPARK_LOCAL_IP"

echo " /spark/bin/spark-class org.apache.spark.deploy.SparkSubmit --properties-file /mmfs1/gscratch/scrubbed/alexidl/shared_work_dir/spark/LHA3_R5_tiny/spark-defaults.conf --conf spark.driver.host=${SPARK_LOCAL_IP} --conf spark.driver.bindAddress=${SPARK_LOCAL_IP} --master spark://10.64.77.44:7077 --class org.janelia.stitching.ConvertCZITilesToN5Spark --conf spark.executor.cores=8 --conf spark.files.openCostInBytes=0 --conf spark.default.parallelism=32 --executor-memory 32g --conf spark.driver.cores=1 --driver-memory 2g /app/app.jar -i /mmfs1/gscratch/scrubbed/alexidl/shared_work_dir/outputs/LHA3_R5_tiny/stitching/tiles.json --blockSize '128,128,64' "

/spark/bin/spark-class org.apache.spark.deploy.SparkSubmit --properties-file /mmfs1/gscratch/scrubbed/alexidl/shared_work_dir/spark/LHA3_R5_tiny/spark-defaults.conf --conf spark.driver.host=${SPARK_LOCAL_IP} --conf spark.driver.bindAddress=${SPARK_LOCAL_IP} --master spark://10.64.77.44:7077 --class org.janelia.stitching.ConvertCZITilesToN5Spark --conf spark.executor.cores=8 --conf spark.files.openCostInBytes=0 --conf spark.default.parallelism=32 --executor-memory 32g --conf spark.driver.cores=1 --driver-memory 2g /app/app.jar -i /mmfs1/gscratch/scrubbed/alexidl/shared_work_dir/outputs/LHA3_R5_tiny/stitching/tiles.json --blockSize '128,128,64' &> /mmfs1/gscratch/scrubbed/alexidl/shared_work_dir/spark/LHA3_R5_tiny/czi2n5.log

Command exit status: 1

Command output: Starting the spark driver Checking for /mmfs1/gscratch/scrubbed/alexidl/shared_work_dir/spark/LHA3_R5_tiny/.sessionId executor > local (15) [77/6cb27a] process > download (1) [100%] 1 of 1 ✔ [54/44824e] process > stitching:prepare_stitching_data (1) [100%] 1 of 1 ✔ [a1/f188e2] process > stitching:stitch:spark_cluster:prepare_spark_work_dir (1) [100%] 1 of 1 ✔ [- ] process > stitching:stitch:spark_cluster:spark_master (1) - [b0/b9370f] process > stitching:stitch:spark_cluster:wait_for_master (1) [100%] 1 of 1 ✔ [- ] process > stitching:stitch:spark_cluster:spark_worker (2) - [42/c32a32] process > stitching:stitch:spark_cluster:wait_for_worker (1) [100%] 4 of 4 ✔ [6f/999881] process > stitching:stitch:run_parse_czi_tiles:spark_start_app (1) [100%] 1 of 1 ✔ [ea/79bc41] process > stitching:stitch:run_czi2n5:spark_start_app (1) [100%] 1 of 1, failed: 1 ✘ [- ] process > stitching:stitch:run_flatfield_correction:spark_start_app - [- ] process > stitching:stitch:run_retile:spark_start_app - [- ] process > stitching:stitch:run_stitching:spark_start_app - [- ] process > stitching:stitch:run_fuse:spark_start_app - [- ] process > stitching:stitch:terminate_stitching - [- ] process > spot_extraction:airlocalize:cut_tiles - [- ] process > spot_extraction:airlocalize:run_airlocalize - [- ] process > spot_extraction:airlocalize:merge_points - [- ] process > segmentation:predict - [- ] process > registration:cut_tiles - [- ] process > registration:fixed_coarse_spots - [- ] process > registration:moving_coarse_spots - [- ] process > registration:coarse_ransac - [- ] process > registration:apply_transform_at_aff_scale - [- ] process > registration:apply_transform_at_def_scale - [- ] process > registration:fixed_spots - [- ] process > registration:moving_spots - [- ] process > registration:ransac_for_tile - [- ] process > registration:interpolate_affines - [- ] process > registration:deform - [- ] process > registration:stitch - [- ] process > registration:final_transform - [- ] process > collect_merge_points:collect_merged_points_files - [- ] process > warp_spots:apply_transform - [- ] process > measure_intensities - [- ] process > assign_spots - Pulling Singularity image docker://public.ecr.aws/janeliascicomp/multifish/segmentation:1.0.0 [cache /mmfs1/home/alexidl/.singularity_cache/public.ecr.aws-janeliascicomp-multifish-segmentation-1.0.0.img] Pulling Singularity image docker://public.ecr.aws/janeliascicomp/multifish/spot_extraction:1.1.0 [cache /mmfs1/home/alexidl/.singularity_cache/public.ecr.aws-janeliascicomp-multifish-spot_extraction-1.1.0.img] ERROR ~ Error executing process > 'stitching:stitch:run_czi2n5:spark_start_app (1)'