Fusion fails on local Spark instance

dpshepherd commented 2 years ago

Hi @trautmane,

As requested, here are the details on what we are running into trying to fuse a BDV file using BigStitcher-Spark. The plugin was built with the code changes on main, but not fix_bdv_n5 as I got a conflict when I tried to merge the two branches. I've attached the XML as well.

It wasn't totally clear if the extraJavaOptions should be passed to the driver or executors when in local mode, so we tried both. The same error as pasted here message pops up. We also tried allocating more RAM to the executors, same error message as pasted here pops up.

The error usually occurs once >6,000 files within the N5 have been written. In this particular case, ~7,200 files were written.

Please let me know what other information I can provide.

Thanks! Doug

Linux version

```linux Linux qi2labserver 5.4.0-74-generic #83~18.04.1-Ubuntu SMP Tue May 11 16:01:00 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux ```

Maven and java details

```linux Apache Maven 3.6.0 Maven home: /usr/share/maven Java version: 1.8.0_312, vendor: Private Build, runtime: /usr/lib/jvm/java-8-openjdk-amd64/jre Default locale: en_US, platform encoding: UTF-8 OS name: "linux", version: "5.4.0-74-generic", arch: "amd64", family: "unix" ```

Spark details

```linux 22/01/04 11:54:33 WARN Utils: Your hostname, qi2labserver resolves to a loopback address: 127.0.1.1; using 10.206.25.77 instead (on interface enp5s0f0) 22/01/04 11:54:33 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.2.0 /_/ Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 1.8.0_312 Branch HEAD Compiled by user ubuntu on 2021-10-06T12:46:30Z Revision 5d45a415f3a29898d92380380cfd82bfc7f579ea Url https://github.com/apache/spark Type --help for more information. ```

Call to Spark

```linux spark-submit --master local[32,8] --conf spark.driver.memory=100G --conf "spark.executor.extraJavaOptions=-XX:ActiveProcessorCount=1" --class net.preibisch.bigstitcher.spark.AffineFusion ~/Documents/github/BigStitcher-Spark/target/BigStitcher-Spark-0.0.1-SNAPSHOT.jar -x /mnt/opm2/20210924b/deskew_flatfield_output/bdv/AMC_cy7_test_bdv.xml -o /mnt/opm2/20210924b/n5/output.n5 -d /DAPI/s0 --channelId 0 --UINT16 --minIntensity 0 --maxIntensity 65535 ```

Error message

```linux (Tue Jan 04 11:21:37 MST 2022): Requesting Img from ImgLoader (tp=0, setup=7), using level=0, [1.0 x 1.0 x 1.0] 22/01/04 11:21:37 ERROR Executor: Exception in task 28.0 in stage 0.0 (TID 28) java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:717) at net.imglib2.cache.queue.FetcherThreads.(FetcherThreads.java:92) at net.imglib2.cache.queue.FetcherThreads.(FetcherThreads.java:70) at bdv.img.hdf5.Hdf5ImageLoader.open(Hdf5ImageLoader.java:209) at bdv.img.hdf5.Hdf5ImageLoader.(Hdf5ImageLoader.java:158) at bdv.img.hdf5.Hdf5ImageLoader.(Hdf5ImageLoader.java:144) at bdv.img.hdf5.Hdf5ImageLoader.(Hdf5ImageLoader.java:139) at bdv.img.hdf5.XmlIoHdf5ImageLoader.fromXml(XmlIoHdf5ImageLoader.java:70) at bdv.img.hdf5.XmlIoHdf5ImageLoader.fromXml(XmlIoHdf5ImageLoader.java:49) at mpicbg.spim.data.generic.sequence.XmlIoAbstractSequenceDescription.fromXml(XmlIoAbstractSequenceDescription.java:111) at mpicbg.spim.data.generic.XmlIoAbstractSpimData.fromXml(XmlIoAbstractSpimData.java:153) at net.preibisch.mvrecon.fiji.spimdata.XmlIoSpimData2.fromXml(XmlIoSpimData2.java:164) at net.preibisch.mvrecon.fiji.spimdata.XmlIoSpimData2.fromXml(XmlIoSpimData2.java:52) at mpicbg.spim.data.generic.XmlIoAbstractSpimData.load(XmlIoAbstractSpimData.java:95) at net.preibisch.bigstitcher.spark.AffineFusion.lambda$call$c48314ca$1(AffineFusion.java:208) at org.apache.spark.api.java.JavaRDDLike.$anonfun$foreach$1(JavaRDDLike.scala:352) at org.apache.spark.api.java.JavaRDDLike.$anonfun$foreach$1$adapted(JavaRDDLike.scala:352) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) at org.apache.spark.rdd.RDD.$anonfun$foreach$2(RDD.scala:1012) at org.apache.spark.rdd.RDD.$anonfun$foreach$2$adapted(RDD.scala:1012) at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2254) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 22/01/04 11:21:37 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker for task 28.0 in stage 0.0 (TID 28),5,main] java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:717) at net.imglib2.cache.queue.FetcherThreads.(FetcherThreads.java:92) at net.imglib2.cache.queue.FetcherThreads.(FetcherThreads.java:70) at bdv.img.hdf5.Hdf5ImageLoader.open(Hdf5ImageLoader.java:209) at bdv.img.hdf5.Hdf5ImageLoader.(Hdf5ImageLoader.java:158) at bdv.img.hdf5.Hdf5ImageLoader.(Hdf5ImageLoader.java:144) at bdv.img.hdf5.Hdf5ImageLoader.(Hdf5ImageLoader.java:139) at bdv.img.hdf5.XmlIoHdf5ImageLoader.fromXml(XmlIoHdf5ImageLoader.java:70) at bdv.img.hdf5.XmlIoHdf5ImageLoader.fromXml(XmlIoHdf5ImageLoader.java:49) at mpicbg.spim.data.generic.sequence.XmlIoAbstractSequenceDescription.fromXml(XmlIoAbstractSequenceDescription.java:111) at mpicbg.spim.data.generic.XmlIoAbstractSpimData.fromXml(XmlIoAbstractSpimData.java:153) at net.preibisch.mvrecon.fiji.spimdata.XmlIoSpimData2.fromXml(XmlIoSpimData2.java:164) at net.preibisch.mvrecon.fiji.spimdata.XmlIoSpimData2.fromXml(XmlIoSpimData2.java:52) at mpicbg.spim.data.generic.XmlIoAbstractSpimData.load(XmlIoAbstractSpimData.java:95) at net.preibisch.bigstitcher.spark.AffineFusion.lambda$call$c48314ca$1(AffineFusion.java:208) at org.apache.spark.api.java.JavaRDDLike.$anonfun$foreach$1(JavaRDDLike.scala:352) at org.apache.spark.api.java.JavaRDDLike.$anonfun$foreach$1$adapted(JavaRDDLike.scala:352) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) at org.apache.spark.rdd.RDD.$anonfun$foreach$2(RDD.scala:1012) at org.apache.spark.rdd.RDD.$anonfun$foreach$2$adapted(RDD.scala:1012) at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2254) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 22/01/04 11:21:37 ERROR Inbox: An error happened while processing message in the inbox for LocalSchedulerBackendEndpoint java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:717) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367) at org.apache.spark.scheduler.TaskResultGetter.enqueueFailedTask(TaskResultGetter.scala:137) at org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:817) at org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:791) at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:71) at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Exception in thread "dispatcher-event-loop-30" java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:717) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) at java.util.concurrent.ThreadPoolExecutor.processWorkerExit(ThreadPoolExecutor.java:1025) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ```

XML file: AMC_cy7_test_bdv.zip

StephanPreibisch commented 2 years ago

Hi, I think its because you load from HDF5, which we didn't fix yet ... soon it'll work :)

dpshepherd commented 2 years ago

No problem, just giving the information that Eric asked for. Happy to try again when appropriate.

Thanks!

trautmane commented 2 years ago

Hi @dpshepherd - I pulled in changes from @tpietzsch that hopefully fix this problem for you. Can you build a 0.0.2-SNAPSHOT fatjar (-P fatjar) from the main branch, remove the -XX:ActiveProcessorCount=1 hack, and re-run? Let us know how it goes. Thanks!

dpshepherd commented 2 years ago

Hi @trautmane - we are still getting errors.

Call to BigStitcher-Spark 0.0.2

```linux (base) dps@qi2labserver:~/Documents/github/BigStitcher-Spark$ spark-submit --master local[32,8] --conf spark.driver.memory=100G --class net.preibisch.bigstitcher.spark.AffineFusion ~/Documents/github/BigStitcher-Spark/target/BigStitcher-Spark-0.0.2-SNAPSHOT.jar -x /mnt/opm2/20210924b/deskew_flatfield_output/bdv/AMC_cy7_test_bdv.xml -o /mnt/opm2/20210924b/n5/output2.n5 -d /DAPI/s0 --channelId 0 --UINT16 --minIntensity 0 --maxIntensity 65535 ```

Startup and initial errors (I crtl+c 'ed to grab this error. It appears to repeat for every task, which I don't want to paste here).

```linux 22/01/13 10:44:59 WARN Utils: Your hostname, qi2labserver resolves to a loopback address: 127.0.1.1; using 10.206.25.77 instead (on interface enp5s0f0) 22/01/13 10:44:59 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address [-x, /mnt/opm2/20210924b/deskew_flatfield_output/bdv/AMC_cy7_test_bdv.xml, -o, /mnt/opm2/20210924b/n5/output2.n5, -d, /DAPI/s0, --channelId, 0, --UINT16, --minIntensity, 0, --maxIntensity, 65535] log4j:WARN No appenders could be found for logger (net.preibisch.bigstitcher.spark.util.Spark). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. Parsing selected angle ids ... all Parsing selected channel ids ... [0] Parsing selected illumination ids ... all Parsing selected tile ids ... all Parsing selected timepoint ids ... all Following ViewIds will be fused: [0,0] [0,1] [0,2] [0,3] [0,4] [0,5] [0,6] [0,7] [0,8] [0,9] [0,10] [0,11] [0,12] [0,13] [0,14] [0,15] [0,16] [0,17] [0,18] [0,19] [0,20] [0,21] [0,22] [0,23] [0,24] [0,25] [0,26] [0,27] [0,28] [0,29] [0,30] [0,31] [0,32] [0,33] [0,34] [0,35] [0,36] [0,37] [0,38] [0,39] [0,40] [0,41] [0,42] [0,43] [0,44] [0,45] [0,46] [0,47] [0,48] [0,49] [0,50] [0,51] [0,52] [0,53] [0,54] [0,55] [0,56] [0,57] [0,58] [0,59] [0,60] [0,61] [0,62] [0,63] [0,64] [0,65] [0,66] [0,67] [0,68] [0,69] [0,70] [0,71] [0,72] [0,73] [0,74] [0,75] [0,76] [0,77] [0,78] [0,79] [0,80] [0,81] [0,82] [0,83] [0,84] [0,85] Fusing: All Views: [-1428, -75163, -110621] -> [63284, -58029, -109544], dimensions (64713, 17135, 1078) with blocksize (128, 128, 128) Fusing to UINT16, min intensity = 0.0, max intensity = 65535.0 numBlocks = 610236 Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 22/01/13 10:45:02 INFO SparkContext: Running Spark version 3.2.0 22/01/13 10:45:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 22/01/13 10:45:02 INFO ResourceUtils: ============================================================== 22/01/13 10:45:02 INFO ResourceUtils: No custom resources configured for spark.driver. 22/01/13 10:45:02 INFO ResourceUtils: ============================================================== 22/01/13 10:45:02 INFO SparkContext: Submitted application: AffineFusion 22/01/13 10:45:02 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0) 22/01/13 10:45:02 INFO ResourceProfile: Limiting resource is cpu 22/01/13 10:45:02 INFO ResourceProfileManager: Added ResourceProfile id: 0 22/01/13 10:45:02 INFO SecurityManager: Changing view acls to: dps 22/01/13 10:45:02 INFO SecurityManager: Changing modify acls to: dps 22/01/13 10:45:02 INFO SecurityManager: Changing view acls groups to: 22/01/13 10:45:02 INFO SecurityManager: Changing modify acls groups to: 22/01/13 10:45:02 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(dps); groups with view permissions: Set(); users with modify permissions: Set(dps); groups with modify permissions: Set() 22/01/13 10:45:02 INFO Utils: Successfully started service 'sparkDriver' on port 41523. 22/01/13 10:45:02 INFO SparkEnv: Registering MapOutputTracker 22/01/13 10:45:02 INFO SparkEnv: Registering BlockManagerMaster 22/01/13 10:45:02 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information 22/01/13 10:45:02 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up 22/01/13 10:45:02 INFO SparkEnv: Registering BlockManagerMasterHeartbeat 22/01/13 10:45:03 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-15bbea53-b9af-45ff-a7ca-3a0ac11a942d 22/01/13 10:45:03 INFO MemoryStore: MemoryStore started with capacity 53.2 GiB 22/01/13 10:45:03 INFO SparkEnv: Registering OutputCommitCoordinator 22/01/13 10:45:03 INFO Utils: Successfully started service 'SparkUI' on port 4040. 22/01/13 10:45:03 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://qi2labserver.dhcp.asu.edu:4040 22/01/13 10:45:03 INFO SparkContext: Added JAR file:/home/dps/Documents/github/BigStitcher-Spark/target/BigStitcher-Spark-0.0.2-SNAPSHOT.jar at spark://qi2labserver.dhcp.asu.edu:41523/jars/BigStitcher-Spark-0.0.2-SNAPSHOT.jar with timestamp 1642095902077 22/01/13 10:45:03 INFO Executor: Starting executor ID driver on host qi2labserver.dhcp.asu.edu 22/01/13 10:45:03 INFO Executor: Fetching spark://qi2labserver.dhcp.asu.edu:41523/jars/BigStitcher-Spark-0.0.2-SNAPSHOT.jar with timestamp 1642095902077 22/01/13 10:45:03 INFO TransportClientFactory: Successfully created connection to qi2labserver.dhcp.asu.edu/10.206.25.77:41523 after 53 ms (0 ms spent in bootstraps) 22/01/13 10:45:03 INFO Utils: Fetching spark://qi2labserver.dhcp.asu.edu:41523/jars/BigStitcher-Spark-0.0.2-SNAPSHOT.jar to /tmp/spark-b0cb6f58-4ebc-437f-b08d-a960bc184a05/userFiles-c78f3c8d-d749-47b5-8dfa-870cafe413d3/fetchFileTemp1734594664953306082.tmp 22/01/13 10:45:04 INFO Executor: Adding file:/tmp/spark-b0cb6f58-4ebc-437f-b08d-a960bc184a05/userFiles-c78f3c8d-d749-47b5-8dfa-870cafe413d3/BigStitcher-Spark-0.0.2-SNAPSHOT.jar to class loader 22/01/13 10:45:04 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 42821. 22/01/13 10:45:04 INFO NettyBlockTransferService: Server created on qi2labserver.dhcp.asu.edu:42821 22/01/13 10:45:04 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy 22/01/13 10:45:04 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, qi2labserver.dhcp.asu.edu, 42821, None) 22/01/13 10:45:04 INFO BlockManagerMasterEndpoint: Registering block manager qi2labserver.dhcp.asu.edu:42821 with 53.2 GiB RAM, BlockManagerId(driver, qi2labserver.dhcp.asu.edu, 42821, None) 22/01/13 10:45:04 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, qi2labserver.dhcp.asu.edu, 42821, None) 22/01/13 10:45:04 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, qi2labserver.dhcp.asu.edu, 42821, None) 22/01/13 10:45:08 ERROR Executor: Exception in task 8.0 in stage 0.0 (TID 8) java.lang.NullPointerException at net.preibisch.bigstitcher.spark.util.ViewUtil.getTransformedBoundingBox(ViewUtil.java:32) at net.preibisch.bigstitcher.spark.AffineFusion.lambda$call$c48314ca$1(AffineFusion.java:224) at org.apache.spark.api.java.JavaRDDLike.$anonfun$foreach$1(JavaRDDLike.scala:352) at org.apache.spark.api.java.JavaRDDLike.$anonfun$foreach$1$adapted(JavaRDDLike.scala:352) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) at org.apache.spark.rdd.RDD.$anonfun$foreach$2(RDD.scala:1012) at org.apache.spark.rdd.RDD.$anonfun$foreach$2$adapted(RDD.scala:1012) at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2254) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 22/01/13 10:45:08 ERROR Executor: Exception in task 18.0 in stage 0.0 (TID 18) java.lang.NullPointerException at net.preibisch.bigstitcher.spark.util.ViewUtil.getTransformedBoundingBox(ViewUtil.java:32) at net.preibisch.bigstitcher.spark.AffineFusion.lambda$call$c48314ca$1(AffineFusion.java:224) at org.apache.spark.api.java.JavaRDDLike.$anonfun$foreach$1(JavaRDDLike.scala:352) at org.apache.spark.api.java.JavaRDDLike.$anonfun$foreach$1$adapted(JavaRDDLike.scala:352) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) at org.apache.spark.rdd.RDD.$anonfun$foreach$2(RDD.scala:1012) at org.apache.spark.rdd.RDD.$anonfun$foreach$2$adapted(RDD.scala:1012) at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2254) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ```

The fusion then starts, runs for awhile, throws another error, and then quietly exits.

Final error and quiet exit.

```linux 22/01/13 10:22:35 ERROR Inbox: Ignoring error java.util.concurrent.RejectedExecutionException: Task org.apache.spark.executor.Executor$TaskRunner@1ddf4f79 rejected from java.util.concurrent.ThreadPoolExecutor@1ad52726[Shutting down, pool size = 32, active threads = 32, queued tasks = 0, completed tasks = 39] at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063) at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379) at org.apache.spark.executor.Executor.launchTask(Executor.scala:270) at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93) at org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91) at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:68) at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) org.apache.spark.SparkException: Job 0 cancelled because SparkContext was shut down at org.apache.spark.scheduler.DAGScheduler.$anonfun$cleanUpAfterSchedulerStop$1(DAGScheduler.scala:1115) at org.apache.spark.scheduler.DAGScheduler.$anonfun$cleanUpAfterSchedulerStop$1$adapted(DAGScheduler.scala:1113) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:1113) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:2615) at org.apache.spark.util.EventLoop.stop(EventLoop.scala:84) at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:2515) at org.apache.spark.SparkContext.$anonfun$stop$12(SparkContext.scala:2086) at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1442) at org.apache.spark.SparkContext.stop(SparkContext.scala:2086) at org.apache.spark.SparkContext.$anonfun$new$38(SparkContext.scala:667) at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214) at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2019) at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at scala.util.Try$.apply(Try.scala:213) at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:898) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2214) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2235) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2254) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2279) at org.apache.spark.rdd.RDD.$anonfun$foreach$1(RDD.scala:1012) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) at org.apache.spark.rdd.RDD.foreach(RDD.scala:1010) at org.apache.spark.api.java.JavaRDDLike.foreach(JavaRDDLike.scala:352) at org.apache.spark.api.java.JavaRDDLike.foreach$(JavaRDDLike.scala:351) at org.apache.spark.api.java.AbstractJavaRDDLike.foreach(JavaRDDLike.scala:45) at net.preibisch.bigstitcher.spark.AffineFusion.call(AffineFusion.java:204) at net.preibisch.bigstitcher.spark.AffineFusion.call(AffineFusion.java:39) at picocli.CommandLine.executeUserObject(CommandLine.java:1853) at picocli.CommandLine.access$1100(CommandLine.java:145) at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2255) at picocli.CommandLine$RunLast.handle(CommandLine.java:2249) at picocli.CommandLine$RunLast.handle(CommandLine.java:2213) at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2080) at picocli.CommandLine.execute(CommandLine.java:1978) at net.preibisch.bigstitcher.spark.AffineFusion.main(AffineFusion.java:283) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) (1.0, 1.0, 1.0) testing scale: 3d-affine: (1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0) (1.0, 1.0, 1.0) valid: true bestScaling: 1.0 testing scale: 3d-affine: (8.0, 0.0, 0.0, 0.0, 0.0, 8.0, 0.0, 0.0, 0.0, 0.0, 4.0, 0.0) (8.0, 8.0, 4.0) valid: false bestScaling: 1.0 testing scale: 3d-affine: (16.0, 0.0, 0.0, 0.0, 0.0, 16.0, 0.0, 0.0, 0.0, 0.0, 8.0, 0.0) (16.0, 16.0, 8.0) valid: false bestScaling: 1.0 Choosing resolution level: 1.0 x 1.0 x 1.0 View tpId=0 setupId=18 is currently scaled by: (1, 1, 1) (Thu Jan 13 10:22:35 MST 2022): Requesting Img from ImgLoader (tp=0, setup=18), using level=0, [1.0 x 1.0 x 1.0] (1.0, 1.0, 1.0) testing scale: 3d-affine: (1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0) (1.0, 1.0, 1.0) valid: true bestScaling: 1.0 testing scale: 3d-affine: (8.0, 0.0, 0.0, 0.0, 0.0, 8.0, 0.0, 0.0, 0.0, 0.0, 4.0, 0.0) (8.0, 8.0, 4.0) valid: false bestScaling: 1.0 testing scale: 3d-affine: (16.0, 0.0, 0.0, 0.0, 0.0, 16.0, 0.0, 0.0, 0.0, 0.0, 8.0, 0.0) (16.0, 16.0, 8.0) valid: false bestScaling: 1.0 Choosing resolution level: 1.0 x 1.0 x 1.0 View tpId=0 setupId=19 is currently scaled by: (1, 1, 1) (Thu Jan 13 10:22:35 MST 2022): Requesting Img from ImgLoader (tp=0, setup=19), using level=0, [1.0 x 1.0 x 1.0] ```

trautmane commented 2 years ago

Thanks @dpshepherd - can you post the log from the worker that ran task 8 (should contain the NPE you see in the driver log)? The worker logs usually are a stdout or stderr file depending upon your Spark setup.

Oops - just noticed you are running locally. I'm interested in getSparkJobSpimData2: returning ... in the log(s). Do you see that anywhere?

trautmane commented 2 years ago

I just added some friendlier error checking in the area where you are seeing NPEs: https://github.com/PreibischLab/BigStitcher-Spark/commit/3e0e6753b037a6eefe16d18634efd1850499d63e

Can you rebuild with the latest code, recreate the issue, and post/share the logs? Thanks!

StephanPreibisch commented 2 years ago

Hi @dpshepherd, could you please send us the XML you are using? This might be happening because of missing views I suspect ...

dpshepherd commented 2 years ago

It’s uploaded in the original post for this issue above.

I started the fusion again with logging and it hasn’t crashed yet. It did give the same errors in the beginning as I posted before.

StephanPreibisch commented 2 years ago

Thanks! Ok, we could rule out the missing views issue ... will check with @trautmane in more detail what's going on ...

dpshepherd commented 2 years ago

Here is the log for the fusion, which completed. There do appear to be errors in it.

A spot check on the output appears that it worked. I started creating a downsampled pyramid using the n5-spark tools for viewing, since it is so large. local-1642170932286.zip

trautmane commented 2 years ago

Thanks @dpshepherd - your latest log shows that for many (maybe all?) of your 32 tasks, the first two attempts to load a setupImgLoader for setupId 0, timePointId 0 fails.

The third attempt then succeeds and processing continues.

I'll talk with @StephanPreibisch to get a better idea of what might be causing this. Thanks for your patience ...

dpshepherd commented 2 years ago

@trautmane - no problem!

Let me know if there is anything we can do on our side. The drive that the data is stored on is network mounted via fiber connection - I wonder if this has anything to do with it? We've never seen a problem like this before using parallel processing in Python with Zarr stores on that drive or using BigStitcher (both through Fiji GUI and headless) on that drive.

I unfortunately have no experience developing in Java and this is the first time we have used Spark.

dpshepherd commented 2 years ago

Ok, I can verify there are missing chunks (black boxes).

trautmane commented 2 years ago

Hi @dpshepherd - I was able to recreate the latest setupImgLoader issue you encountered which allowed me to experiment with a couple of fix options. This commit fixed the issues for me. Can you rebuild with the latest code and let us know how it goes?

Thanks!

dpshepherd commented 2 years ago

I rebuilt and started with logging. Peeking at the log I don't see the large number of errors at initialization. Will update once it finishes.

dpshepherd commented 2 years ago

Hi @trautmane - this run finished without any errors in 18 hours! This is a massive speed-up from BigStitcher, the last time we tried to stitch this data on the same machine it was going to take ~336 hours (14 days). One observation is the RAM usage climbed up to a max of ~250 Gb. Is this expected? It is not a problem for us, because our server has a lot of ram - but may be important for others.

I am creating the downsampled pyramid for viewing and will post a screenshot to verify there are no missing blocks then.

Thanks!!!

StephanPreibisch commented 2 years ago

Awesome, thanks so much @trautmane and @dpshepherd for pushing through :)

dpshepherd commented 2 years ago

hi @StephanPreibisch and @trautmane

Appears successful with the n5-spark n5-scale-pyramid.py. We do get some file I/O errors when running the n5-sparkcode, but can't find any missing blocks. This is one channel (nuclei using DAPI) of a roughly 100 um thick brain biopsy that is >1 cm long on the long axis imaged with our Snouty-OPM. We'll start the other three channels fusing now.

Exciting and thanks again!!

JaneliaSciComp / BigStitcher-Spark

Fusion fails on local Spark instance #5