JaneliaSciComp / multifish

EASI-FISH analysis pipeline for spatial transcriptomics
BSD 3-Clause "New" or "Revised" License
32 stars 13 forks source link

LHA3_R5_tiny Stitching cannot find .sessionID on proprietary cluster #30

Closed Alex-de-Lecea closed 1 year ago

Alex-de-Lecea commented 1 year ago

Bug report

Description of the problem

I am running the pipeline with the sample data provided in the paper, LHA3_R5_tiny, on my institution's cluster, Klone HYAK cluster. I am encountering issues during the stitching portion of the pipeline. I would appreciate any feedback on this issue.

Log file(s)

Error message

[- ] process > assign_spots - ERROR ~ Error executing process > 'stitching:stitch:run_czi2n5:spark_start_app (1)'

Caused by: Process stitching:stitch:run_czi2n5:spark_start_app (1) terminated with an error exit status (1)

Command executed:

echo "Starting the spark driver"

SESSION_FILE="/mmfs1/gscratch/scrubbed/alexidl/shared_work_dir/spark/LHA3_R5_tiny/.sessionId" echo "Checking for $SESSION_FILE" SLEEP_SECS=10 MAX_WAIT_SECS=1000 SECONDS=0

while ! test -e "$SESSION_FILE"; do sleep ${SLEEP_SECS} if (( ${SECONDS} < ${MAX_WAIT_SECS} )); then echo "Waiting for $SESSION_FILE" SECONDS=$(( ${SECONDS} + ${SLEEP_SECS} )) else echo "-------------------------------------------------------------------------------" echo "ERROR: Timed out after ${SECONDS} seconds while waiting for $SESSION_FILE " echo "Make sure that your --spark_work_dir is accessible to all nodes in the cluster " echo "-------------------------------------------------------------------------------" exit 1 fi done

if ! grep -F -x -q "5f44f17a-5b05-47a7-b1a7-c6aeea9a7b53" $SESSION_FILE then echo "------------------------------------------------------------------------------" echo "ERROR: session id in $SESSION_FILE does not match current session " echo "Make sure that your --spark_work_dir is accessible to all nodes in the cluster" echo "and that you are not running multiple pipelines with the same --spark_work_dir" echo "------------------------------------------------------------------------------" exit 1 fi

export SPARK_ENV_LOADED= export SPARK_HOME=/spark export PYSPARK_PYTHONPATH_SET= export PYTHONPATH="/spark/python" export SPARK_LOG_DIR="/mmfs1/gscratch/scrubbed/alexidl/shared_work_dir/spark/LHA3_R5_tiny"

. "/spark/sbin/spark-config.sh" . "/spark/bin/load-spark-env.sh"

SPARK_LOCAL_IP=hostname -i | rev | cut -d' ' -f1 | rev echo "Use Spark IP: $SPARK_LOCAL_IP"

echo " /spark/bin/spark-class org.apache.spark.deploy.SparkSubmit --properties-file /mmfs1/gscratch/scrubbed/alexidl/shared_work_dir/spark/LHA3_R5_tiny/spark-defaults.conf --conf spark.driver.host=${SPARK_LOCAL_IP} --conf spark.driver.bindAddress=${SPARK_LOCAL_IP} --master spark://10.64.77.44:7077 --class org.janelia.stitching.ConvertCZITilesToN5Spark --conf spark.executor.cores=8 --conf spark.files.openCostInBytes=0 --conf spark.default.parallelism=32 --executor-memory 32g --conf spark.driver.cores=1 --driver-memory 2g /app/app.jar -i /mmfs1/gscratch/scrubbed/alexidl/shared_work_dir/outputs/LHA3_R5_tiny/stitching/tiles.json --blockSize '128,128,64' "

/spark/bin/spark-class org.apache.spark.deploy.SparkSubmit --properties-file /mmfs1/gscratch/scrubbed/alexidl/shared_work_dir/spark/LHA3_R5_tiny/spark-defaults.conf --conf spark.driver.host=${SPARK_LOCAL_IP} --conf spark.driver.bindAddress=${SPARK_LOCAL_IP} --master spark://10.64.77.44:7077 --class org.janelia.stitching.ConvertCZITilesToN5Spark --conf spark.executor.cores=8 --conf spark.files.openCostInBytes=0 --conf spark.default.parallelism=32 --executor-memory 32g --conf spark.driver.cores=1 --driver-memory 2g /app/app.jar -i /mmfs1/gscratch/scrubbed/alexidl/shared_work_dir/outputs/LHA3_R5_tiny/stitching/tiles.json --blockSize '128,128,64' &> /mmfs1/gscratch/scrubbed/alexidl/shared_work_dir/spark/LHA3_R5_tiny/czi2n5.log

Command exit status: 1

Command output: Starting the spark driver Checking for /mmfs1/gscratch/scrubbed/alexidl/shared_work_dir/spark/LHA3_R5_tiny/.sessionId executor > local (15) [77/6cb27a] process > download (1) [100%] 1 of 1 ✔ [54/44824e] process > stitching:prepare_stitching_data (1) [100%] 1 of 1 ✔ [a1/f188e2] process > stitching:stitch:spark_cluster:prepare_spark_work_dir (1) [100%] 1 of 1 ✔ [- ] process > stitching:stitch:spark_cluster:spark_master (1) - [b0/b9370f] process > stitching:stitch:spark_cluster:wait_for_master (1) [100%] 1 of 1 ✔ [- ] process > stitching:stitch:spark_cluster:spark_worker (2) - [42/c32a32] process > stitching:stitch:spark_cluster:wait_for_worker (1) [100%] 4 of 4 ✔ [6f/999881] process > stitching:stitch:run_parse_czi_tiles:spark_start_app (1) [100%] 1 of 1 ✔ [ea/79bc41] process > stitching:stitch:run_czi2n5:spark_start_app (1) [100%] 1 of 1, failed: 1 ✘ [- ] process > stitching:stitch:run_flatfield_correction:spark_start_app - [- ] process > stitching:stitch:run_retile:spark_start_app - [- ] process > stitching:stitch:run_stitching:spark_start_app - [- ] process > stitching:stitch:run_fuse:spark_start_app - [- ] process > stitching:stitch:terminate_stitching - [- ] process > spot_extraction:airlocalize:cut_tiles - [- ] process > spot_extraction:airlocalize:run_airlocalize - [- ] process > spot_extraction:airlocalize:merge_points - [- ] process > segmentation:predict - [- ] process > registration:cut_tiles - [- ] process > registration:fixed_coarse_spots - [- ] process > registration:moving_coarse_spots - [- ] process > registration:coarse_ransac - [- ] process > registration:apply_transform_at_aff_scale - [- ] process > registration:apply_transform_at_def_scale - [- ] process > registration:fixed_spots - [- ] process > registration:moving_spots - [- ] process > registration:ransac_for_tile - [- ] process > registration:interpolate_affines - [- ] process > registration:deform - [- ] process > registration:stitch - [- ] process > registration:final_transform - [- ] process > collect_merge_points:collect_merged_points_files - [- ] process > warp_spots:apply_transform - [- ] process > measure_intensities - [- ] process > assign_spots - Pulling Singularity image docker://public.ecr.aws/janeliascicomp/multifish/segmentation:1.0.0 [cache /mmfs1/home/alexidl/.singularity_cache/public.ecr.aws-janeliascicomp-multifish-segmentation-1.0.0.img] Pulling Singularity image docker://public.ecr.aws/janeliascicomp/multifish/spot_extraction:1.1.0 [cache /mmfs1/home/alexidl/.singularity_cache/public.ecr.aws-janeliascicomp-multifish-spot_extraction-1.1.0.img] ERROR ~ Error executing process > 'stitching:stitch:run_czi2n5:spark_start_app (1)'

Caused by: Process stitching:stitch:run_czi2n5:spark_start_app (1) terminated with an error exit status (1)

Command executed:

echo "Starting the spark driver"

SESSION_FILE="/mmfs1/gscratch/scrubbed/alexidl/shared_work_dir/spark/LHA3_R5_tiny/.sessionId" echo "Checking for $SESSION_FILE" SLEEP_SECS=10 MAX_WAIT_SECS=1000 SECONDS=0

while ! test -e "$SESSION_FILE"; do sleep ${SLEEP_SECS} if (( ${SECONDS} < ${MAX_WAIT_SECS} )); then echo "Waiting for $SESSION_FILE" SECONDS=$(( ${SECONDS} + ${SLEEP_SECS} )) else echo "-------------------------------------------------------------------------------" echo "ERROR: Timed out after ${SECONDS} seconds while waiting for $SESSION_FILE " echo "Make sure that your --spark_work_dir is accessible to all nodes in the cluster " echo "-------------------------------------------------------------------------------" exit 1 fi done

if ! grep -F -x -q "5f44f17a-5b05-47a7-b1a7-c6aeea9a7b53" $SESSION_FILE then echo "------------------------------------------------------------------------------" echo "ERROR: session id in $SESSION_FILE does not match current session " echo "Make sure that your --spark_work_dir is accessible to all nodes in the cluster" echo "and that you are not running multiple pipelines with the same --spark_work_dir" echo "------------------------------------------------------------------------------" exit 1 fi

export SPARK_ENV_LOADED= export SPARK_HOME=/spark export PYSPARK_PYTHONPATH_SET= export PYTHONPATH="/spark/python" export SPARK_LOG_DIR="/mmfs1/gscratch/scrubbed/alexidl/shared_work_dir/spark/LHA3_R5_tiny"

. "/spark/sbin/spark-config.sh" . "/spark/bin/load-spark-env.sh"

SPARK_LOCAL_IP=hostname -i | rev | cut -d' ' -f1 | rev echo "Use Spark IP: $SPARK_LOCAL_IP"

echo " /spark/bin/spark-class org.apache.spark.deploy.SparkSubmit --properties-file /mmfs1/gscratch/scrubbed/alexidl/shared_work_dir/spark/LHA3_R5_tiny/spark-defaults.conf --conf spark.driver.host=${SPARK_LOCAL_IP} --conf spark.driver.bindAddress=${SPARK_LOCAL_IP} --master spark://10.64.77.44:7077 --class org.janelia.stitching.ConvertCZITilesToN5Spark --conf spark.executor.cores=8 --conf spark.files.openCostInBytes=0 --conf spark.default.parallelism=32 --executor-memory 32g --conf spark.driver.cores=1 --driver-memory 2g /app/app.jar -i /mmfs1/gscratch/scrubbed/alexidl/shared_work_dir/outputs/LHA3_R5_tiny/stitching/tiles.json --blockSize '128,128,64' "

/spark/bin/spark-class org.apache.spark.deploy.SparkSubmit --properties-file /mmfs1/gscratch/scrubbed/alexidl/shared_work_dir/spark/LHA3_R5_tiny/spark-defaults.conf --conf spark.driver.host=${SPARK_LOCAL_IP} --conf spark.driver.bindAddress=${SPARK_LOCAL_IP} --master spark://10.64.77.44:7077 --class org.janelia.stitching.ConvertCZITilesToN5Spark --conf spark.executor.cores=8 --conf spark.files.openCostInBytes=0 --conf spark.default.parallelism=32 --executor-memory 32g --conf spark.driver.cores=1 --driver-memory 2g /app/app.jar -i /mmfs1/gscratch/scrubbed/alexidl/shared_work_dir/outputs/LHA3_R5_tiny/stitching/tiles.json --blockSize '128,128,64' &> /mmfs1/gscratch/scrubbed/alexidl/shared_work_dir/spark/LHA3_R5_tiny/czi2n5.log

Command exit status: 1

Command output: Starting the spark driver Checking for /mmfs1/gscratch/scrubbed/alexidl/shared_work_dir/spark/LHA3_R5_tiny/.sessionId Use Spark IP: 10.64.77.44 /spark/bin/spark-class org.apache.spark.deploy.SparkSubmit --properties-file /mmfs1/gscratch/scrubbed/alexidl/shared_work_dir/spark/LHA3_R5_tiny/spark-defaults.conf --conf spark.driver.host=10.64.77.44 --conf spark.driver.bindAddress=10.64.77.44 --master spark://10.64.77.44:7077 --class org.janelia.stitching.ConvertCZITilesToN5Spark --conf spark.executor.cores=8 --conf spark.files.openCostInBytes=0 --conf spark.default.parallelism=32 --executor-memory 32g --conf spark.driver.cores=1 --driver-memory 2g /app/app.jar -i /mmfs1/gscratch/scrubbed/alexidl/shared_work_dir/outputs/LHA3_R5_tiny/stitching/tiles.json --blockSize '128,128,64'

Command error: INFO: Environment variable SINGULARITYENV_TMPDIR is set, but APPTAINERENV_TMPDIR is preferred INFO: underlay of /usr/bin/nvidia-smi required more than 50 (429) bind mounts Starting the spark driver Checking for /mmfs1/gscratch/scrubbed/alexidl/shared_work_dir/spark/LHA3_R5_tiny/.sessionId Use Spark IP: 10.64.77.44 /spark/bin/spark-class org.apache.spark.deploy.SparkSubmit --properties-file /mmfs1/gscratch/scrubbed/alexidl/shared_work_dir/spark/LHA3_R5_tiny/spark-defaults.conf --conf spark.driver.host=10.64.77.44 --conf spark.driver.bindAddress=10.64.77.44 --master spark://10.64.77.44:7077 --class org.janelia.stitching.ConvertCZITilesToN5Spark --conf spark.executor.cores=8 --conf spark.files.openCostInBytes=0 --conf spark.default.parallelism=32 --executor-memory 32g --conf spark.driver.cores=1 --driver-memory 2g /app/app.jar -i /mmfs1/gscratch/scrubbed/alexidl/shared_work_dir/outputs/LHA3_R5_tiny/stitching/tiles.json --blockSize '128,128,64'

Work dir: /mmfs1/gscratch/scrubbed/alexidl/multifish/work/ea/79bc41d1f1b43abfa1b9690085a181

Tip: view the complete command output by changing to the process work dir and entering the command cat .command.out

-- Check '.nextflow.log' file for details

czi2n5.log

23/04/13 18:33:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Converting tiles to N5... Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 23/04/13 18:33:35 INFO SparkContext: Running Spark version 3.0.1 23/04/13 18:33:35 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN). 23/04/13 18:33:35 INFO ResourceUtils: ============================================================== 23/04/13 18:33:35 INFO ResourceUtils: Resources for spark.driver:

23/04/13 18:33:35 INFO ResourceUtils: ============================================================== 23/04/13 18:33:35 INFO SparkContext: Submitted application: ConvertCZITilesToN5Spark 23/04/13 18:33:35 INFO SecurityManager: Changing view acls to: alexidl 23/04/13 18:33:36 INFO SecurityManager: Changing modify acls to: alexidl 23/04/13 18:33:36 INFO SecurityManager: Changing view acls groups to: 23/04/13 18:33:36 INFO SecurityManager: Changing modify acls groups to: 23/04/13 18:33:36 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(alexidl); groups with view permissions: Set(); users with modify permissions: Set(alexidl); groups with modify permissions: Set() 23/04/13 18:33:36 INFO Utils: Successfully started service 'sparkDriver' on port 40731. 23/04/13 18:33:36 INFO SparkEnv: Registering MapOutputTracker 23/04/13 18:33:36 INFO SparkEnv: Registering BlockManagerMaster 23/04/13 18:33:36 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information 23/04/13 18:33:36 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up 23/04/13 18:33:36 INFO SparkEnv: Registering BlockManagerMasterHeartbeat 23/04/13 18:33:36 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-e7c9f7a7-6d19-4059-9fcc-4f15dc408364 23/04/13 18:33:36 INFO MemoryStore: MemoryStore started with capacity 997.8 MiB 23/04/13 18:33:36 INFO SparkEnv: Registering OutputCommitCoordinator 23/04/13 18:33:36 INFO Utils: Successfully started service 'SparkUI' on port 4040. 23/04/13 18:33:36 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://10.64.77.44:4040 23/04/13 18:33:36 INFO SparkContext: Added JAR file:/app/app.jar at spark://10.64.77.44:40731/jars/app.jar with timestamp 1681410816673 23/04/13 18:33:36 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://10.64.77.44:7077... 23/04/13 18:33:36 INFO TransportClientFactory: Successfully created connection to /10.64.77.44:7077 after 26 ms (0 ms spent in bootstraps) 23/04/13 18:33:36 INFO StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20230413183336-0001 23/04/13 18:33:36 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20230413183336-0001/0 on worker-20230413183309-10.64.77.44-40827 (10.64.77.44:40827) with 8 core(s) 23/04/13 18:33:36 INFO StandaloneSchedulerBackend: Granted executor ID app-20230413183336-0001/0 on hostPort 10.64.77.44:40827 with 8 core(s), 32.0 GiB RAM 23/04/13 18:33:36 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20230413183336-0001/1 on worker-20230413183309-10.64.77.44-39913 (10.64.77.44:39913) with 8 core(s) 23/04/13 18:33:36 INFO StandaloneSchedulerBackend: Granted executor ID app-20230413183336-0001/1 on hostPort 10.64.77.44:39913 with 8 core(s), 32.0 GiB RAM 23/04/13 18:33:36 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20230413183336-0001/2 on worker-20230413183309-10.64.77.44-43963 (10.64.77.44:43963) with 8 core(s) 23/04/13 18:33:36 INFO StandaloneSchedulerBackend: Granted executor ID app-20230413183336-0001/2 on hostPort 10.64.77.44:43963 with 8 core(s), 32.0 GiB RAM 23/04/13 18:33:36 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20230413183336-0001/3 on worker-20230413183309-10.64.77.44-42821 (10.64.77.44:42821) with 8 core(s) 23/04/13 18:33:36 INFO StandaloneSchedulerBackend: Granted executor ID app-20230413183336-0001/3 on hostPort 10.64.77.44:42821 with 8 core(s), 32.0 GiB RAM 23/04/13 18:33:37 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 34801. 23/04/13 18:33:37 INFO NettyBlockTransferService: Server created on 10.64.77.44:34801 23/04/13 18:33:37 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy 23/04/13 18:33:37 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.64.77.44, 34801, None) 23/04/13 18:33:37 INFO BlockManagerMasterEndpoint: Registering block manager 10.64.77.44:34801 with 997.8 MiB RAM, BlockManagerId(driver, 10.64.77.44, 34801, None) 23/04/13 18:33:37 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.64.77.44, 34801, None) 23/04/13 18:33:37 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.64.77.44, 34801, None) 23/04/13 18:33:37 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20230413183336-0001/2 is now RUNNING 23/04/13 18:33:37 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20230413183336-0001/3 is now RUNNING 23/04/13 18:33:37 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20230413183336-0001/0 is now RUNNING 23/04/13 18:33:37 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20230413183336-0001/1 is now RUNNING 23/04/13 18:33:37 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0 Detected 2 channels 23/04/13 18:33:37 INFO SparkContext: Starting job: foreach at ConvertCZITilesToN5Spark.java:210 23/04/13 18:33:37 INFO DAGScheduler: Got job 0 (foreach at ConvertCZITilesToN5Spark.java:210) with 12 output partitions 23/04/13 18:33:37 INFO DAGScheduler: Final stage: ResultStage 0 (foreach at ConvertCZITilesToN5Spark.java:210) 23/04/13 18:33:37 INFO DAGScheduler: Parents of final stage: List() 23/04/13 18:33:37 INFO DAGScheduler: Missing parents: List() 23/04/13 18:33:37 INFO DAGScheduler: Submitting ResultStage 0 (ParallelCollectionRDD[0] at parallelize at ConvertCZITilesToN5Spark.java:209), which has no missing parents 23/04/13 18:33:37 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 3.2 KiB, free 997.8 MiB) 23/04/13 18:33:38 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1943.0 B, free 997.8 MiB) 23/04/13 18:33:38 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.64.77.44:34801 (size: 1943.0 B, free: 997.8 MiB) 23/04/13 18:33:38 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1223 23/04/13 18:33:38 INFO DAGScheduler: Submitting 12 missing tasks from ResultStage 0 (ParallelCollectionRDD[0] at parallelize at ConvertCZITilesToN5Spark.java:209) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11)) 23/04/13 18:33:38 INFO TaskSchedulerImpl: Adding task set 0.0 with 12 tasks 23/04/13 18:33:38 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 8, script: , vendor: , memory -> name: memory, amount: 32768, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0) 23/04/13 18:33:38 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.64.77.44:60916) with ID 2 23/04/13 18:33:38 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.64.77.44:60914) with ID 3 23/04/13 18:33:38 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.64.77.44:60920) with ID 0 23/04/13 18:33:38 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.64.77.44:60918) with ID 1 23/04/13 18:33:38 INFO BlockManagerMasterEndpoint: Registering block manager 10.64.77.44:44029 with 16.9 GiB RAM, BlockManagerId(2, 10.64.77.44, 44029, None) 23/04/13 18:33:38 INFO BlockManagerMasterEndpoint: Registering block manager 10.64.77.44:35243 with 16.9 GiB RAM, BlockManagerId(3, 10.64.77.44, 35243, None) 23/04/13 18:33:38 INFO BlockManagerMasterEndpoint: Registering block manager 10.64.77.44:42331 with 16.9 GiB RAM, BlockManagerId(1, 10.64.77.44, 42331, None) 23/04/13 18:33:38 INFO BlockManagerMasterEndpoint: Registering block manager 10.64.77.44:40415 with 16.9 GiB RAM, BlockManagerId(0, 10.64.77.44, 40415, None) 23/04/13 18:33:39 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 10.64.77.44, executor 2, partition 0, PROCESS_LOCAL, 7450 bytes) 23/04/13 18:33:39 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 10.64.77.44, executor 2, partition 1, PROCESS_LOCAL, 7450 bytes) 23/04/13 18:33:39 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, 10.64.77.44, executor 2, partition 2, PROCESS_LOCAL, 7452 bytes) 23/04/13 18:33:39 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, 10.64.77.44, executor 2, partition 3, PROCESS_LOCAL, 7452 bytes) 23/04/13 18:33:39 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4, 10.64.77.44, executor 2, partition 4, PROCESS_LOCAL, 7452 bytes) 23/04/13 18:33:39 INFO TaskSetManager: Starting task 5.0 in stage 0.0 (TID 5, 10.64.77.44, executor 2, partition 5, PROCESS_LOCAL, 7452 bytes) 23/04/13 18:33:39 INFO TaskSetManager: Starting task 6.0 in stage 0.0 (TID 6, 10.64.77.44, executor 2, partition 6, PROCESS_LOCAL, 7450 bytes) 23/04/13 18:33:39 INFO TaskSetManager: Starting task 7.0 in stage 0.0 (TID 7, 10.64.77.44, executor 2, partition 7, PROCESS_LOCAL, 7450 bytes) 23/04/13 18:33:39 INFO TaskSetManager: Starting task 8.0 in stage 0.0 (TID 8, 10.64.77.44, executor 1, partition 8, PROCESS_LOCAL, 7452 bytes) 23/04/13 18:33:39 INFO TaskSetManager: Starting task 9.0 in stage 0.0 (TID 9, 10.64.77.44, executor 1, partition 9, PROCESS_LOCAL, 7452 bytes) 23/04/13 18:33:39 INFO TaskSetManager: Starting task 10.0 in stage 0.0 (TID 10, 10.64.77.44, executor 1, partition 10, PROCESS_LOCAL, 7452 bytes) 23/04/13 18:33:39 INFO TaskSetManager: Starting task 11.0 in stage 0.0 (TID 11, 10.64.77.44, executor 1, partition 11, PROCESS_LOCAL, 7452 bytes) 23/04/13 18:33:40 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.64.77.44:44029 (size: 1943.0 B, free: 16.9 GiB) 23/04/13 18:33:40 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.64.77.44:42331 (size: 1943.0 B, free: 16.9 GiB) 23/04/13 18:33:43 WARN TaskSetManager: Lost task 8.0 in stage 0.0 (TID 8, 10.64.77.44, executor 1): java.lang.RuntimeException: Identified that all tile images are stored in a single .czi container, but there are not enough images in the loaded image series (file=/mmfs1/gscratch/scrubbed/alexidl/shared_work_dir/outputs/LHA3_R5_tiny/stitching/LHA3_R5_tiny.czi, numImages=1, tileIndex=1 at org.janelia.stitching.ConvertCZITilesToN5Spark.lambda$convertTilesToN5$78a89b37$1(ConvertCZITilesToN5Spark.java:229) at org.apache.spark.api.java.JavaRDDLike.$anonfun$foreach$1(JavaRDDLike.scala:351) at org.apache.spark.api.java.JavaRDDLike.$anonfun$foreach$1$adapted(JavaRDDLike.scala:351) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) at org.apache.spark.rdd.RDD.$anonfun$foreach$2(RDD.scala:986) at org.apache.spark.rdd.RDD.$anonfun$foreach$2$adapted(RDD.scala:986) at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2139) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

23/04/13 18:33:43 INFO TaskSetManager: Lost task 11.0 in stage 0.0 (TID 11) on 10.64.77.44, executor 1: java.lang.RuntimeException (Identified that all tile images are stored in a single .czi container, but there are not enough images in the loaded image series (file=/mmfs1/gscratch/scrubbed/alexidl/shared_work_dir/outputs/LHA3_R5_tiny/stitching/LHA3_R5_tiny.czi, numImages=1, tileIndex=1) [duplicate 1] 23/04/13 18:33:43 INFO TaskSetManager: Lost task 9.0 in stage 0.0 (TID 9) on 10.64.77.44, executor 1: java.lang.RuntimeException (Identified that all tile images are stored in a single .czi container, but there are not enough images in the loaded image series (file=/mmfs1/gscratch/scrubbed/alexidl/shared_work_dir/outputs/LHA3_R5_tiny/stitching/LHA3_R5_tiny.czi, numImages=1, tileIndex=1) [duplicate 2] 23/04/13 18:33:43 INFO TaskSetManager: Lost task 10.0 in stage 0.0 (TID 10) on 10.64.77.44, executor 1: java.lang.RuntimeException (Identified that all tile images are stored in a single .czi container, but there are not enough images in the loaded image series (file=/mmfs1/gscratch/scrubbed/alexidl/shared_work_dir/outputs/LHA3_R5_tiny/stitching/LHA3_R5_tiny.czi, numImages=1, tileIndex=1) [duplicate 3] 23/04/13 18:33:43 INFO TaskSetManager: Starting task 10.1 in stage 0.0 (TID 12, 10.64.77.44, executor 0, partition 10, PROCESS_LOCAL, 7452 bytes) 23/04/13 18:33:43 INFO TaskSetManager: Starting task 9.1 in stage 0.0 (TID 13, 10.64.77.44, executor 3, partition 9, PROCESS_LOCAL, 7452 bytes) 23/04/13 18:33:43 INFO TaskSetManager: Starting task 11.1 in stage 0.0 (TID 14, 10.64.77.44, executor 1, partition 11, PROCESS_LOCAL, 7452 bytes) 23/04/13 18:33:43 INFO TaskSetManager: Starting task 8.1 in stage 0.0 (TID 15, 10.64.77.44, executor 0, partition 8, PROCESS_LOCAL, 7452 bytes) 23/04/13 18:33:43 INFO TaskSetManager: Lost task 11.1 in stage 0.0 (TID 14) on 10.64.77.44, executor 1: java.lang.RuntimeException (Identified that all tile images are stored in a single .czi container, but there are not enough images in the loaded image series (file=/mmfs1/gscratch/scrubbed/alexidl/shared_work_dir/outputs/LHA3_R5_tiny/stitching/LHA3_R5_tiny.czi, numImages=1, tileIndex=1) [duplicate 4] 23/04/13 18:33:43 INFO TaskSetManager: Starting task 11.2 in stage 0.0 (TID 16, 10.64.77.44, executor 3, partition 11, PROCESS_LOCAL, 7452 bytes) 23/04/13 18:33:43 INFO TaskSetManager: Lost task 7.0 in stage 0.0 (TID 7) on 10.64.77.44, executor 2: java.lang.RuntimeException (Identified that all tile images are stored in a single .czi container, but there are not enough images in the loaded image series (file=/mmfs1/gscratch/scrubbed/alexidl/shared_work_dir/outputs/LHA3_R5_tiny/stitching/LHA3_R5_tiny.czi, numImages=1, tileIndex=1) [duplicate 5] 23/04/13 18:33:43 INFO TaskSetManager: Starting task 7.1 in stage 0.0 (TID 17, 10.64.77.44, executor 2, partition 7, PROCESS_LOCAL, 7450 bytes) 23/04/13 18:33:43 INFO TaskSetManager: Lost task 6.0 in stage 0.0 (TID 6) on 10.64.77.44, executor 2: java.lang.RuntimeException (Identified that all tile images are stored in a single .czi container, but there are not enough images in the loaded image series (file=/mmfs1/gscratch/scrubbed/alexidl/shared_work_dir/outputs/LHA3_R5_tiny/stitching/LHA3_R5_tiny.czi, numImages=1, tileIndex=1) [duplicate 6] 23/04/13 18:33:43 INFO TaskSetManager: Starting task 6.1 in stage 0.0 (TID 18, 10.64.77.44, executor 1, partition 6, PROCESS_LOCAL, 7450 bytes) 23/04/13 18:33:43 INFO TaskSetManager: Lost task 6.1 in stage 0.0 (TID 18) on 10.64.77.44, executor 1: java.lang.RuntimeException (Identified that all tile images are stored in a single .czi container, but there are not enough images in the loaded image series (file=/mmfs1/gscratch/scrubbed/alexidl/shared_work_dir/outputs/LHA3_R5_tiny/stitching/LHA3_R5_tiny.czi, numImages=1, tileIndex=1) [duplicate 7] 23/04/13 18:33:43 INFO TaskSetManager: Starting task 6.2 in stage 0.0 (TID 19, 10.64.77.44, executor 1, partition 6, PROCESS_LOCAL, 7450 bytes) 23/04/13 18:33:43 INFO TaskSetManager: Lost task 7.1 in stage 0.0 (TID 17) on 10.64.77.44, executor 2: java.lang.RuntimeException (Identified that all tile images are stored in a single .czi container, but there are not enough images in the loaded image series (file=/mmfs1/gscratch/scrubbed/alexidl/shared_work_dir/outputs/LHA3_R5_tiny/stitching/LHA3_R5_tiny.czi, numImages=1, tileIndex=1) [duplicate 8] 23/04/13 18:33:43 INFO TaskSetManager: Starting task 7.2 in stage 0.0 (TID 20, 10.64.77.44, executor 3, partition 7, PROCESS_LOCAL, 7450 bytes) 23/04/13 18:33:43 INFO TaskSetManager: Lost task 6.2 in stage 0.0 (TID 19) on 10.64.77.44, executor 1: java.lang.RuntimeException (Identified that all tile images are stored in a single .czi container, but there are not enough images in the loaded image series (file=/mmfs1/gscratch/scrubbed/alexidl/shared_work_dir/outputs/LHA3_R5_tiny/stitching/LHA3_R5_tiny.czi, numImages=1, tileIndex=1) [duplicate 9] 23/04/13 18:33:43 INFO TaskSetManager: Starting task 6.3 in stage 0.0 (TID 21, 10.64.77.44, executor 3, partition 6, PROCESS_LOCAL, 7450 bytes) 23/04/13 18:33:44 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.64.77.44:35243 (size: 1943.0 B, free: 16.9 GiB) 23/04/13 18:33:44 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.64.77.44:40415 (size: 1943.0 B, free: 16.9 GiB) 23/04/13 18:33:45 INFO TaskSetManager: Lost task 8.1 in stage 0.0 (TID 15) on 10.64.77.44, executor 0: java.lang.RuntimeException (Identified that all tile images are stored in a single .czi container, but there are not enough images in the loaded image series (file=/mmfs1/gscratch/scrubbed/alexidl/shared_work_dir/outputs/LHA3_R5_tiny/stitching/LHA3_R5_tiny.czi, numImages=1, tileIndex=1) [duplicate 10] 23/04/13 18:33:45 INFO TaskSetManager: Starting task 8.2 in stage 0.0 (TID 22, 10.64.77.44, executor 2, partition 8, PROCESS_LOCAL, 7452 bytes) 23/04/13 18:33:45 INFO TaskSetManager: Lost task 10.1 in stage 0.0 (TID 12) on 10.64.77.44, executor 0: java.lang.RuntimeException (Identified that all tile images are stored in a single .czi container, but there are not enough images in the loaded image series (file=/mmfs1/gscratch/scrubbed/alexidl/shared_work_dir/outputs/LHA3_R5_tiny/stitching/LHA3_R5_tiny.czi, numImages=1, tileIndex=1) [duplicate 11] 23/04/13 18:33:45 INFO TaskSetManager: Starting task 10.2 in stage 0.0 (TID 23, 10.64.77.44, executor 2, partition 10, PROCESS_LOCAL, 7452 bytes) 23/04/13 18:33:46 INFO TaskSetManager: Lost task 9.1 in stage 0.0 (TID 13) on 10.64.77.44, executor 3: java.lang.RuntimeException (Identified that all tile images are stored in a single .czi container, but there are not enough images in the loaded image series (file=/mmfs1/gscratch/scrubbed/alexidl/shared_work_dir/outputs/LHA3_R5_tiny/stitching/LHA3_R5_tiny.czi, numImages=1, tileIndex=1) [duplicate 12] 23/04/13 18:33:46 INFO TaskSetManager: Lost task 11.2 in stage 0.0 (TID 16) on 10.64.77.44, executor 3: java.lang.RuntimeException (Identified that all tile images are stored in a single .czi container, but there are not enough images in the loaded image series (file=/mmfs1/gscratch/scrubbed/alexidl/shared_work_dir/outputs/LHA3_R5_tiny/stitching/LHA3_R5_tiny.czi, numImages=1, tileIndex=1) [duplicate 13] 23/04/13 18:33:46 INFO TaskSetManager: Starting task 11.3 in stage 0.0 (TID 24, 10.64.77.44, executor 3, partition 11, PROCESS_LOCAL, 7452 bytes) 23/04/13 18:33:46 INFO TaskSetManager: Starting task 9.2 in stage 0.0 (TID 25, 10.64.77.44, executor 3, partition 9, PROCESS_LOCAL, 7452 bytes) 23/04/13 18:33:46 INFO TaskSetManager: Lost task 6.3 in stage 0.0 (TID 21) on 10.64.77.44, executor 3: java.lang.RuntimeException (Identified that all tile images are stored in a single .czi container, but there are not enough images in the loaded image series (file=/mmfs1/gscratch/scrubbed/alexidl/shared_work_dir/outputs/LHA3_R5_tiny/stitching/LHA3_R5_tiny.czi, numImages=1, tileIndex=1) [duplicate 14] 23/04/13 18:33:46 ERROR TaskSetManager: Task 6 in stage 0.0 failed 4 times; aborting job 23/04/13 18:33:46 INFO TaskSetManager: Lost task 7.2 in stage 0.0 (TID 20) on 10.64.77.44, executor 3: java.lang.RuntimeException (Identified that all tile images are stored in a single .czi container, but there are not enough images in the loaded image series (file=/mmfs1/gscratch/scrubbed/alexidl/shared_work_dir/outputs/LHA3_R5_tiny/stitching/LHA3_R5_tiny.czi, numImages=1, tileIndex=1) [duplicate 15] 23/04/13 18:33:46 INFO TaskSchedulerImpl: Cancelling stage 0 23/04/13 18:33:46 INFO TaskSchedulerImpl: Killing all running tasks in stage 0: Stage cancelled 23/04/13 18:33:46 INFO TaskSchedulerImpl: Stage 0 was cancelled 23/04/13 18:33:46 INFO DAGScheduler: ResultStage 0 (foreach at ConvertCZITilesToN5Spark.java:210) failed in 8.164 s due to Job aborted due to stage failure: Task 6 in stage 0.0 failed 4 times, most recent failure: Lost task 6.3 in stage 0.0 (TID 21, 10.64.77.44, executor 3): java.lang.RuntimeException: Identified that all tile images are stored in a single .czi container, but there are not enough images in the loaded image series (file=/mmfs1/gscratch/scrubbed/alexidl/shared_work_dir/outputs/LHA3_R5_tiny/stitching/LHA3_R5_tiny.czi, numImages=1, tileIndex=1 at org.janelia.stitching.ConvertCZITilesToN5Spark.lambda$convertTilesToN5$78a89b37$1(ConvertCZITilesToN5Spark.java:229) at org.apache.spark.api.java.JavaRDDLike.$anonfun$foreach$1(JavaRDDLike.scala:351) at org.apache.spark.api.java.JavaRDDLike.$anonfun$foreach$1$adapted(JavaRDDLike.scala:351) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) at org.apache.spark.rdd.RDD.$anonfun$foreach$2(RDD.scala:986) at org.apache.spark.rdd.RDD.$anonfun$foreach$2$adapted(RDD.scala:986) at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2139) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

Driver stacktrace: 23/04/13 18:33:46 INFO DAGScheduler: Job 0 failed: foreach at ConvertCZITilesToN5Spark.java:210, took 8.200617 s 23/04/13 18:33:46 INFO SparkUI: Stopped Spark web UI at http://10.64.77.44:4040 23/04/13 18:33:46 INFO StandaloneSchedulerBackend: Shutting down all executors 23/04/13 18:33:46 WARN TaskSetManager: Lost task 8.2 in stage 0.0 (TID 22, 10.64.77.44, executor 2): TaskKilled (Stage cancelled) 23/04/13 18:33:46 WARN TaskSetManager: Lost task 10.2 in stage 0.0 (TID 23, 10.64.77.44, executor 2): TaskKilled (Stage cancelled) 23/04/13 18:33:46 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down 23/04/13 18:33:46 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 23/04/13 18:33:46 INFO MemoryStore: MemoryStore cleared 23/04/13 18:33:46 INFO BlockManager: BlockManager stopped 23/04/13 18:33:46 INFO BlockManagerMaster: BlockManagerMaster stopped 23/04/13 18:33:46 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 23/04/13 18:33:46 INFO SparkContext: Successfully stopped SparkContext Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 0.0 failed 4 times, most recent failure: Lost task 6.3 in stage 0.0 (TID 21, 10.64.77.44, executor 3): java.lang.RuntimeException: Identified that all tile images are stored in a single .czi container, but there are not enough images in the loaded image series (file=/mmfs1/gscratch/scrubbed/alexidl/shared_work_dir/outputs/LHA3_R5_tiny/stitching/LHA3_R5_tiny.czi, numImages=1, tileIndex=1 at org.janelia.stitching.ConvertCZITilesToN5Spark.lambda$convertTilesToN5$78a89b37$1(ConvertCZITilesToN5Spark.java:229) at org.apache.spark.api.java.JavaRDDLike.$anonfun$foreach$1(JavaRDDLike.scala:351) at org.apache.spark.api.java.JavaRDDLike.$anonfun$foreach$1$adapted(JavaRDDLike.scala:351) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) at org.apache.spark.rdd.RDD.$anonfun$foreach$2(RDD.scala:986) at org.apache.spark.rdd.RDD.$anonfun$foreach$2$adapted(RDD.scala:986) at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2139) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2059) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2008) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2007) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2007) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:973) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:973) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:973) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2239) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2188) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2177) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:775) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2120) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2139) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2164) at org.apache.spark.rdd.RDD.$anonfun$foreach$1(RDD.scala:986) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:388) at org.apache.spark.rdd.RDD.foreach(RDD.scala:984) at org.apache.spark.api.java.JavaRDDLike.foreach(JavaRDDLike.scala:351) at org.apache.spark.api.java.JavaRDDLike.foreach$(JavaRDDLike.scala:350) at org.apache.spark.api.java.AbstractJavaRDDLike.foreach(JavaRDDLike.scala:45) at org.janelia.stitching.ConvertCZITilesToN5Spark.convertTilesToN5(ConvertCZITilesToN5Spark.java:210) at org.janelia.stitching.ConvertCZITilesToN5Spark.run(ConvertCZITilesToN5Spark.java:128) at org.janelia.stitching.ConvertCZITilesToN5Spark.main(ConvertCZITilesToN5Spark.java:105) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.RuntimeException: Identified that all tile images are stored in a single .czi container, but there are not enough images in the loaded image series (file=/mmfs1/gscratch/scrubbed/alexidl/shared_work_dir/outputs/LHA3_R5_tiny/stitching/LHA3_R5_tiny.czi, numImages=1, tileIndex=1 at org.janelia.stitching.ConvertCZITilesToN5Spark.lambda$convertTilesToN5$78a89b37$1(ConvertCZITilesToN5Spark.java:229) at org.apache.spark.api.java.JavaRDDLike.$anonfun$foreach$1(JavaRDDLike.scala:351) at org.apache.spark.api.java.JavaRDDLike.$anonfun$foreach$1$adapted(JavaRDDLike.scala:351) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) at org.apache.spark.rdd.RDD.$anonfun$foreach$2(RDD.scala:986) at org.apache.spark.rdd.RDD.$anonfun$foreach$2$adapted(RDD.scala:986) at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2139) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 23/04/13 18:33:46 INFO ShutdownHookManager: Shutdown hook called 23/04/13 18:33:46 INFO ShutdownHookManager: Deleting directory /tmp/spark-1f378436-de2b-41f6-b38f-0369c47da6b1 23/04/13 18:33:46 INFO ShutdownHookManager: Deleting directory /tmp/spark-115d5ddb-7118-4709-871c-2adeba0ad9c1

Environment

Additional context

I have manually adjusted some of the Spark computing allocation to get the pipeline to run. I have already gotten errors due to my data files (LHA3_R5_tiny.csv, LHA3_R5_tiny.mvl) being corrupted in /shared_work_dir/outputs/LHA3_R5_tiny/stitching which I overcame by manually placing these files in that location. This makes me believe there is an issue with file management either on my cluster or the source code. This is supported by the error that /shared_work_dir/spark/LHA3_R5_tiny/.sessionID could not be found. However, the czi2n5.log file seems to give a different explanation.

This is the line I used to run the pipeline nextflow run main.nf --shared_work_dir /mmfs1/gscratch/scrubbed/alexidl/shared_work_dir --acq_names LHA3_R5_tiny --ref_acq LHA3_R5_tiny --dapi_channel --runtime_opts "--nv -B /mmfs1/gscratch/scrubbed/alexidl/shared_work_dir/inputs -B /mmfs1/gscratch/scrubbed/alexidl/shared_work_dir/outputs" --channels c0, c1 --dapi_channel c1

I am allocating the following computing resources: 40 cores 140 G memory 2 a40 GPU's which provide 96 G memory In order to get Spark working on my system I changed a few parameters in /multifish/param_utils.nf highlighted below (any higher allocation fails to assign all workers) : Workers: 4 Gb_per_core: 4

cgoina commented 1 year ago

I don't know if this is the real cause but make sure that the channels parameter are enclosed within double quotes - from what I see in the command you pasted there's a space between c0, c1 so c1 is ignored use --channels "c0,c1". We have never tested this pipeline with SLURM here - we only tested it with LSF and AWS Batch - from my googling Hyak cluster uses SLURM scheduler. Nextflow supports slurm so this should run but we never tested it. Can you check if your spark master and workers are actually running on your Hyak cluster? I would create a profile similar to the lsf profile from nextflow.config and pass in all cluster specific parameters there. On a side node you don't need to change the param_utils.nf source file to change the values of those parameters - all those parameters can be passed in the command line '--workers 4 --gb_per_core 4'. If you find a slurm profile that works please send that back to us and we'll update our configuration.

JienaL commented 1 year ago

Hello, have you solved this problem? I meet the same error.

krokicki commented 1 year ago

@Alex-de-Lecea A few thoughts:

1) What kind of file system is /mmfs1/gscratch/? You need to use a file system that is accessible to all of the nodes in the cluster. Note that /mmfs1/gscratch/scrubbed/alexidl/shared_work_dir/outputs/LHA3_R5_tiny/stitching/LHA3_R5_tiny.czi will be a symbolic link to your input data. Make sure that this link is actually working. There may be issues if your data is on another file system.

2) Are you mounting all your paths into the containers? You can do that using --runtime_opts "-B /path/to/shared/work/dir". You will need to mount the input and output directories, and any other directories that are used by the pipeline.

3) We're not using Apptainer yet, but have experimented with it and it has some major differences with Singularity, mainly to do with setuid. If you cluster admins have not installed apptainer-suid, you may need to add --writable-tmpfs to your Apptainer invocations via --runtime_opts.