JaneliaSciComp / BigStitcher-Spark

Running compute-intense parts of BigStitcher distributed
BSD 2-Clause "Simplified" License
18 stars 10 forks source link

could not find XmlIoBasicImgLoader implementation for format bdv.n5 #2

Closed boazmohar closed 2 years ago

boazmohar commented 2 years ago

Hi @StephanPreibisch,

Tried running this on the Janelia cluster and got this error.

mpicbg.spim.data.SpimDataInstantiationException: could not find XmlIoBasicImgLoader implementation for format bdv.n5

I am using the latest spark-janelia from here

This is how I build the repo and ran it ```bash [login1 - moharb@e05u15]~>git clone https://github.com/PreibischLab/BigStitcher-Spark.git Cloning into 'BigStitcher-Spark'... remote: Enumerating objects: 181, done. remote: Counting objects: 100% (181/181), done. remote: Compressing objects: 100% (104/104), done. remote: Total 181 (delta 69), reused 108 (delta 22), pack-reused 0 Receiving objects: 100% (181/181), 35.31 KiB | 2.35 MiB/s, done. Resolving deltas: 100% (69/69), done. [login1 - moharb@e05u15]~>cd BigStitcher-Spark/ [login1 - moharb@e05u15]~/BigStitcher-Spark>~/apache-maven-3.8.4/bin/mvn clean package [INFO] Scanning for projects... [INFO] [INFO] ------------------< net.preibisch:BigStitcher-Spark >------------------- [INFO] Building BigStitcher Spark 0.0.1-SNAPSHOT [INFO] --------------------------------[ jar ]--------------------------------- [INFO] [INFO] --- maven-clean-plugin:3.1.0:clean (default-clean) @ BigStitcher-Spark --- [INFO] [INFO] --- maven-enforcer-plugin:3.0.0-M3:enforce (enforce-rules) @ BigStitcher-Spark --- [INFO] Adding ignore: module-info [INFO] Adding ignore: META-INF/versions/*/module-info [INFO] Adding ignore: com.esotericsoftware.kryo.* [INFO] Adding ignore: com.esotericsoftware.minlog.* [INFO] Adding ignore: com.esotericsoftware.reflectasm.* [INFO] Adding ignore: com.google.inject.* [INFO] Adding ignore: jnr.ffi.* [INFO] Adding ignore: org.apache.hadoop.yarn.*.package-info [INFO] Adding ignore: org.apache.spark.unused.UnusedStubClass [INFO] Adding ignore: org.hibernate.stat.ConcurrentStatisticsImpl [INFO] Adding ignore: org.jetbrains.kotlin.daemon.common.* [INFO] Adding ignore: org.junit.runner.Runner [INFO] Adding ignore: module-info [INFO] Adding ignore: module-info [INFO] [INFO] --- build-helper-maven-plugin:3.0.0:regex-property (sanitize-version) @ BigStitcher-Spark --- [INFO] [INFO] --- build-helper-maven-plugin:3.0.0:regex-property (guess-package) @ BigStitcher-Spark --- [INFO] [INFO] --- buildnumber-maven-plugin:1.4:create (default) @ BigStitcher-Spark --- [INFO] Executing: /bin/sh -c cd '/groups/spruston/home/moharb/BigStitcher-Spark' && 'git' 'rev-parse' '--verify' 'HEAD' [INFO] Working directory: /groups/spruston/home/moharb/BigStitcher-Spark [INFO] Storing buildNumber: e2b676364526588195f16931e998a7a756ca778b at timestamp: 1640470297294 [INFO] Storing buildScmBranch: main [INFO] [INFO] --- scijava-maven-plugin:2.0.0:set-rootdir (set-rootdir) @ BigStitcher-Spark --- [INFO] Setting rootdir: /groups/spruston/home/moharb/BigStitcher-Spark [INFO] [INFO] --- jacoco-maven-plugin:0.8.6:prepare-agent (jacoco-initialize) @ BigStitcher-Spark --- [WARNING] The artifact xml-apis:xml-apis:jar:2.0.2 has been relocated to xml-apis:xml-apis:jar:1.0.b2 [INFO] argLine set to -javaagent:/groups/spruston/home/moharb/.m2/repository/org/jacoco/org.jacoco.agent/0.8.6/org.jacoco.agent-0.8.6-runtime.jar=destfile=/groups/spruston/home/moharb/BigStitcher-Spark/target/jacoco.exec [INFO] [INFO] --- maven-resources-plugin:3.1.0:resources (default-resources) @ BigStitcher-Spark --- [INFO] Using 'UTF-8' encoding to copy filtered resources. [INFO] skip non existing resourceDirectory /groups/spruston/home/moharb/BigStitcher-Spark/src/main/resources [INFO] [INFO] --- maven-compiler-plugin:3.8.1:compile (default-compile) @ BigStitcher-Spark --- [INFO] Compiling 6 source files to /groups/spruston/home/moharb/BigStitcher-Spark/target/classes [INFO] [INFO] --- maven-resources-plugin:3.1.0:testResources (default-testResources) @ BigStitcher-Spark --- [INFO] Using 'UTF-8' encoding to copy filtered resources. [INFO] skip non existing resourceDirectory /groups/spruston/home/moharb/BigStitcher-Spark/src/test/resources [INFO] [INFO] --- maven-compiler-plugin:3.8.1:testCompile (default-testCompile) @ BigStitcher-Spark --- [INFO] No sources to compile [INFO] [INFO] --- maven-surefire-plugin:2.22.2:test (default-test) @ BigStitcher-Spark --- [INFO] No tests to run. [INFO] [INFO] --- maven-jar-plugin:3.2.0:jar (default-jar) @ BigStitcher-Spark --- [INFO] Building jar: /groups/spruston/home/moharb/BigStitcher-Spark/target/BigStitcher-Spark.jar [INFO] [INFO] >>> maven-source-plugin:3.2.1:jar (attach-sources-jar) > generate-sources @ BigStitcher-Spark >>> [INFO] [INFO] --- maven-enforcer-plugin:3.0.0-M3:enforce (enforce-rules) @ BigStitcher-Spark --- [INFO] Adding ignore: module-info [INFO] Adding ignore: META-INF/versions/*/module-info [INFO] Adding ignore: com.esotericsoftware.kryo.* [INFO] Adding ignore: com.esotericsoftware.minlog.* [INFO] Adding ignore: com.esotericsoftware.reflectasm.* [INFO] Adding ignore: com.google.inject.* [INFO] Adding ignore: jnr.ffi.* [INFO] Adding ignore: org.apache.hadoop.yarn.*.package-info [INFO] Adding ignore: org.apache.spark.unused.UnusedStubClass [INFO] Adding ignore: org.hibernate.stat.ConcurrentStatisticsImpl [INFO] Adding ignore: org.jetbrains.kotlin.daemon.common.* [INFO] Adding ignore: org.junit.runner.Runner [INFO] Adding ignore: module-info [INFO] Adding ignore: module-info [INFO] [INFO] --- build-helper-maven-plugin:3.0.0:regex-property (sanitize-version) @ BigStitcher-Spark --- [INFO] [INFO] --- build-helper-maven-plugin:3.0.0:regex-property (guess-package) @ BigStitcher-Spark --- [INFO] [INFO] --- buildnumber-maven-plugin:1.4:create (default) @ BigStitcher-Spark --- [INFO] [INFO] --- scijava-maven-plugin:2.0.0:set-rootdir (set-rootdir) @ BigStitcher-Spark --- [INFO] [INFO] --- jacoco-maven-plugin:0.8.6:prepare-agent (jacoco-initialize) @ BigStitcher-Spark --- [INFO] argLine set to -javaagent:/groups/spruston/home/moharb/.m2/repository/org/jacoco/org.jacoco.agent/0.8.6/org.jacoco.agent-0.8.6-runtime.jar=destfile=/groups/spruston/home/moharb/BigStitcher-Spark/target/jacoco.exec [INFO] [INFO] <<< maven-source-plugin:3.2.1:jar (attach-sources-jar) < generate-sources @ BigStitcher-Spark <<< [INFO] [INFO] [INFO] --- maven-source-plugin:3.2.1:jar (attach-sources-jar) @ BigStitcher-Spark --- [INFO] Building jar: /groups/spruston/home/moharb/BigStitcher-Spark/target/BigStitcher-Spark-sources.jar [INFO] [INFO] --- jacoco-maven-plugin:0.8.6:report (jacoco-site) @ BigStitcher-Spark --- [INFO] Skipping JaCoCo execution due to missing execution data file. [INFO] [INFO] --- maven-assembly-plugin:3.1.1:single (make-assembly) @ BigStitcher-Spark --- [INFO] Building jar: /groups/spruston/home/moharb/BigStitcher-Spark/target/BigStitcher-Spark-jar-with-dependencies.jar [INFO] [INFO] --- maven-jar-plugin:3.2.0:test-jar (default) @ BigStitcher-Spark --- [INFO] Skipping packaging of the test-jar [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 32.978 s [INFO] Finished at: 2021-12-25T17:12:02-05:00 [INFO] ------------------------------------------------------------------------ [login1 - moharb@e05u15]~/BigStitcher-Spark>TERMINATE=1 RUNTIME=8:00 TMPDIR=~/tmp ~/spark-janelia/flintstone.sh 3 \ > ~/BigStitcher-Spark/target/BigStitcher-Spark-jar-with-dependencies.jar \ > net.preibisch.bigstitcher.spark.AffineFusion \ > -x '/nrs/svoboda/moharb/dataset.xml' \ > -o '/nrs/svoboda/moharb/output3.n5' \ > -d '/GFP/s0' \ > --channelId 0 \ > --UINT8 \ > --minIntensity 1 \ > --maxIntensity 254 On e05u15.int.janelia.org with Python 3.6.10 :: Anaconda, Inc., running: /groups/spruston/home/moharb/spark-janelia/spark-janelia --consolidate_logs --nnodes=3 --gb_per_slot=15 --driverslots=32 --worker_slots=32 --minworkers=1 --hard_runtime=8:00 --submitargs=" --verbose --conf spark.default.parallelism=270 --conf spark.executor.instances=6 --conf spark.executor.cores=5 --conf spark.executor.memory=75g --class net.preibisch.bigstitcher.spark.AffineFusion /groups/spruston/home/moharb/BigStitcher-Spark/target/BigStitcher-Spark-jar-with-dependencies.jar -x /nrs/svoboda/moharb/dataset.xml -o /nrs/svoboda/moharb/output3.n5 -d /GFP/s0 --channelId 0 --UINT8 --minIntensity 1 --maxIntensity 254" generate-and-launch-run Created: /groups/spruston/home/moharb/.spark/20211225_171249/conf /groups/spruston/home/moharb/.spark/20211225_171249/logs /groups/spruston/home/moharb/.spark/20211225_171249/scripts Running: /groups/spruston/home/moharb/.spark/20211225_171249/scripts/00-queue-lsf-jobs.sh Sat Dec 25 17:12:49 EST 2021 [e05u15.int.janelia.org] submitting jobs to scheduler This job will be billed to svoboda Job <114095184> is submitted to default queue . This job will be billed to svoboda Job <114095185> is submitted to default queue . This job will be billed to svoboda Job <114095186> is submitted to default queue . This job will be billed to svoboda Job <114095189> is submitted to default queue . This job will be billed to svoboda Job <114095190> is submitted to default queue . Queued jobs are: JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 114095184 moharb PEND local e05u15 *171249_ma Dec 25 17:12 114095185 moharb PEND short e05u15 *171249_ur Dec 25 17:12 114095186 moharb PEND local e05u15 *249_wo[1] Dec 25 17:12 114095186 moharb PEND local e05u15 *249_wo[2] Dec 25 17:12 114095186 moharb PEND local e05u15 *249_wo[3] Dec 25 17:12 114095189 moharb PEND local e05u15 *171249_dr Dec 25 17:12 114095190 moharb PEND short e05u15 *171249_sd Dec 25 17:12 To get web user interface URL after master has started, run: grep "Bound MasterWebUI to" /groups/spruston/home/moharb/.spark/20211225_171249/logs/01-master.log ```
These are the driver logs with the error ```bash Sat Dec 25 17:13:34 EST 2021 [h07u13] running /misc/local/spark-3.0.1/bin/spark-submit --deploy-mode client --master spark://10.36.107.39:7077 --verbose --conf spark.default.parallelism=270 --conf spark.executor.instances=6 --conf spark.executor.cores=5 --conf spark.executor.memory=75g --class net.preibisch.bigstitcher.spark.AffineFusion /groups/spruston/home/moharb/BigStitcher-Spark/target/BigStitcher-Spark-jar-with-dependencies.jar -x /nrs/svoboda/moharb/dataset.xml -o /nrs/svoboda/moharb/output3.n5 -d /GFP/s0 --channelId 0 --UINT8 --minIntensity 1 --maxIntensity 254 Using properties file: /groups/spruston/home/moharb/.spark/20211225_171249/conf/spark-defaults.conf Adding default property: spark.storage.blockManagerHeartBeatMs=30000 Adding default property: spark.driver.maxResultSize=0 Adding default property: spark.kryoserializer.buffer.max=1024m Adding default property: spark.rpc.askTimeout=300s Adding default property: spark.driver.memory=479g Adding default property: spark.submit.deployMode=cluster Adding default property: spark.rpc.retry.wait=30s Adding default property: spark.core.connection.ack.wait.timeout=600s Parsed arguments: master spark://10.36.107.39:7077 deployMode client executorMemory 75g executorCores 5 totalExecutorCores null propertiesFile /groups/spruston/home/moharb/.spark/20211225_171249/conf/spark-defaults.conf driverMemory 479g driverCores null driverExtraClassPath null driverExtraLibraryPath null driverExtraJavaOptions null supervise false queue null numExecutors 6 files null pyFiles null archives null mainClass net.preibisch.bigstitcher.spark.AffineFusion primaryResource file:/groups/spruston/home/moharb/BigStitcher-Spark/target/BigStitcher-Spark-jar-with-dependencies.jar name net.preibisch.bigstitcher.spark.AffineFusion childArgs [-x /nrs/svoboda/moharb/dataset.xml -o /nrs/svoboda/moharb/output3.n5 -d /GFP/s0 --channelId 0 --UINT8 --minIntensity 1 --maxIntensity 254] jars null packages null packagesExclusions null repositories null verbose true Spark properties used, including those specified through --conf and those from the properties file /groups/spruston/home/moharb/.spark/20211225_171249/conf/spark-defaults.conf: (spark.default.parallelism,270) (spark.driver.memory,479g) (spark.executor.instances,6) (spark.executor.memory,75g) (spark.rpc.askTimeout,300s) (spark.storage.blockManagerHeartBeatMs,30000) (spark.kryoserializer.buffer.max,1024m) (spark.submit.deployMode,cluster) (spark.core.connection.ack.wait.timeout,600s) (spark.driver.maxResultSize,0) (spark.rpc.retry.wait,30s) (spark.executor.cores,5) 2021-12-25 17:13:36,473 [main] WARN [NativeCodeLoader]: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Main class: net.preibisch.bigstitcher.spark.AffineFusion Arguments: -x /nrs/svoboda/moharb/dataset.xml -o /nrs/svoboda/moharb/output3.n5 -d /GFP/s0 --channelId 0 --UINT8 --minIntensity 1 --maxIntensity 254 Spark config: (spark.storage.blockManagerHeartBeatMs,30000) (spark.driver.maxResultSize,0) (spark.jars,file:/groups/spruston/home/moharb/BigStitcher-Spark/target/BigStitcher-Spark-jar-with-dependencies.jar) (spark.kryoserializer.buffer.max,1024m) (spark.app.name,net.preibisch.bigstitcher.spark.AffineFusion) (spark.rpc.askTimeout,300s) (spark.driver.memory,479g) (spark.executor.instances,6) (spark.submit.pyFiles,) (spark.default.parallelism,270) (spark.submit.deployMode,client) (spark.master,spark://10.36.107.39:7077) (spark.rpc.retry.wait,30s) (spark.executor.memory,75g) (spark.executor.cores,5) (spark.core.connection.ack.wait.timeout,600s) Classpath elements: file:/groups/spruston/home/moharb/BigStitcher-Spark/target/BigStitcher-Spark-jar-with-dependencies.jar [-x, /nrs/svoboda/moharb/dataset.xml, -o, /nrs/svoboda/moharb/output3.n5, -d, /GFP/s0, --channelId, 0, --UINT8, --minIntensity, 1, --maxIntensity, 254] mpicbg.spim.data.SpimDataInstantiationException: could not find XmlIoBasicImgLoader implementation for format bdv.n5 at mpicbg.spim.data.generic.sequence.ImgLoaders.createXmlIoForFormat(ImgLoaders.java:72) at mpicbg.spim.data.generic.sequence.XmlIoAbstractSequenceDescription.fromXml(XmlIoAbstractSequenceDescription.java:110) at mpicbg.spim.data.generic.XmlIoAbstractSpimData.fromXml(XmlIoAbstractSpimData.java:153) at net.preibisch.mvrecon.fiji.spimdata.XmlIoSpimData2.fromXml(XmlIoSpimData2.java:164) at net.preibisch.mvrecon.fiji.spimdata.XmlIoSpimData2.fromXml(XmlIoSpimData2.java:52) at mpicbg.spim.data.generic.XmlIoAbstractSpimData.load(XmlIoAbstractSpimData.java:95) at net.preibisch.bigstitcher.spark.AffineFusion.call(AffineFusion.java:94) at net.preibisch.bigstitcher.spark.AffineFusion.call(AffineFusion.java:39) at picocli.CommandLine.executeUserObject(CommandLine.java:1853) at picocli.CommandLine.access$1100(CommandLine.java:145) at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2255) at picocli.CommandLine$RunLast.handle(CommandLine.java:2249) at picocli.CommandLine$RunLast.handle(CommandLine.java:2213) at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2080) at picocli.CommandLine.execute(CommandLine.java:1978) at net.preibisch.bigstitcher.spark.AffineFusion.main(AffineFusion.java:283) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 2021-12-25 17:13:37,007 [Thread-1] INFO [ShutdownHookManager]: Shutdown hook called 2021-12-25 17:13:37,007 [Thread-1] INFO [ShutdownHookManager]: Deleting directory /tmp/spark-fcf876a8-e1e9-4de3-bffe-d05ee3fb0a4a ------------------------------------------------------------ Sender: LSF System Subject: Job 114095189: in cluster Exited Job was submitted from host by user in cluster at Sat Dec 25 17:12:49 2021 Job was executed on host(s) <32*h07u13>, in queue , as user in cluster at Sat Dec 25 17:13:31 2021 was used as the home directory. was used as the working directory. Started at Sat Dec 25 17:13:31 2021 Terminated at Sat Dec 25 17:13:37 2021 Results reported at Sat Dec 25 17:13:37 2021 Your job looked like: ------------------------------------------------------------ # LSBATCH: User input /groups/spruston/home/moharb/.spark/20211225_171249/scripts/04-launch-driver.sh ------------------------------------------------------------ Exited with exit code 1. Resource usage summary: CPU time : 3.96 sec. Max Memory : 11 MB Average Memory : 11.00 MB Total Requested Memory : 491520.00 MB Delta Memory : 491509.00 MB Max Swap : - Max Processes : 4 Max Threads : 5 Run time : 6 sec. Turnaround time : 48 sec. The output (if any) is above this job summary. ```

Same happens for nonRigid, am I doing something wrong?

Thanks! Boaz

StephanPreibisch commented 2 years ago

Thanks @boazmohar ... @tpietzsch we have seen this behavior before, do you remember what the reason was? Thanks :)

StephanPreibisch commented 2 years ago

not your fault @boazmohar ... this is about the annotations that are used to find Imgloader implementations at runtime ...

StephanPreibisch commented 2 years ago

Hi @boazmohar, I found how we solved a the same problem before ... I updated the pom.xml

Can you please build with

mvn clean package -P fatjar

and use the BigStitcher-Spark-0.0.1-SNAPSHOT.jar that is created? It should work ...

boazmohar commented 2 years ago

@StephanPreibisch it changed location of the error, it is now on the workers running the task:

2021-12-26 10:06:29,223 [task-result-getter-2] ERROR [TaskSetManager]: Task 35 in stage 0.0 failed 4 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 35 in stage 0.0 failed 4 times, most recent failure: Lost task 35.3 in stage 0.0 (TID 129, 10.36.107.23, executor 6): mpicbg.spim.data.SpimDataInstantiationException: could not find XmlIoBasicImgLoader implementation for format bdv.n5
    at mpicbg.spim.data.generic.sequence.ImgLoaders.createXmlIoForFormat(ImgLoaders.java:72)
    at mpicbg.spim.data.generic.sequence.XmlIoAbstractSequenceDescription.fromXml(XmlIoAbstractSequenceDescription.java:110)
boazmohar commented 2 years ago

My bad, there was an issue with access to my home folder, putting the .jar on nrs solved it!

StephanPreibisch commented 2 years ago

Cool, did you already try it on a bigger dataset?

boazmohar commented 2 years ago

It is failing again, same error. I submitted a help desk ticket and asked Tim Wang to try running it.

StephanPreibisch commented 2 years ago

Which same error? I thought it worked? If the same error persists we need to solve it …

boazmohar commented 2 years ago

Well, it worked on the example dataset ONCE:

Saved, e.g. view with './n5-view -i /nrs/svoboda/moharb/output3.n5 -d /GFP/s0
done, took: 8220 ms.

Full log: 04-driver.log

But the worker stdout still has this error: stdout.txt

Here is one where the dataset is bigger and it failed. maybe it is running out of memory and the error is unrelated??

stdout_failed.txt

StephanPreibisch commented 2 years ago

Hi, there is still a bug when creating the FATJAR ... I will need to examine what is going on

StephanPreibisch commented 2 years ago

Hi @boazmohar, can you please pull the newest version, call mvn clean package -P fatjar and try:

java -cp target/BigStitcher-Spark-0.0.1-SNAPSHOT.jar net.preibisch.bigstitcher.spark.TestDependencies -x '/Users/spreibi/Documents/Microscopy/Stitching/Truman/standard/dataset.xml'

It should load and display one of the stacks ... works for me, meaning it can find the N5 imgloader

StephanPreibisch commented 2 years ago

of course adjust the XML path and maybe try it locally on your machine ... and then on an interactive session on the cluster. If that all works, so should the Spark resaving (I think).

boazmohar commented 2 years ago

It is working locally and from nomechine on the cluster. It is still not working on spark. One thing I noticed is that it is launching the tasks before loading the jar (at least in the log file):

021-12-27 11:34:41,447 [dispatcher-Executor] INFO [ResourceUtils]: ==============================================================
2021-12-27 11:34:41,467 [dispatcher-Executor] INFO [CoarseGrainedExecutorBackend]: Successfully registered with driver
2021-12-27 11:34:41,469 [dispatcher-Executor] INFO [Executor]: Starting executor ID 0 on host 10.36.107.13
2021-12-27 11:34:41,557 [dispatcher-Executor] INFO [Utils]: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 17913.
2021-12-27 11:34:41,558 [dispatcher-Executor] INFO [NettyBlockTransferService]: Server created on 10.36.107.13:17913
2021-12-27 11:34:41,559 [dispatcher-Executor] INFO [BlockManager]: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
2021-12-27 11:34:41,565 [dispatcher-Executor] INFO [BlockManagerMaster]: Registering BlockManager BlockManagerId(0, 10.36.107.13, 17913, None)
2021-12-27 11:34:41,572 [dispatcher-Executor] INFO [BlockManagerMaster]: Registered BlockManager BlockManagerId(0, 10.36.107.13, 17913, None)
2021-12-27 11:34:41,573 [dispatcher-Executor] INFO [BlockManager]: Initialized BlockManager: BlockManagerId(0, 10.36.107.13, 17913, None)
2021-12-27 11:34:41,648 [dispatcher-Executor] INFO [CoarseGrainedExecutorBackend]: Got assigned task 20
2021-12-27 11:34:41,651 [dispatcher-Executor] INFO [CoarseGrainedExecutorBackend]: Got assigned task 21
2021-12-27 11:34:41,652 [dispatcher-Executor] INFO [CoarseGrainedExecutorBackend]: Got assigned task 22
2021-12-27 11:34:41,652 [dispatcher-Executor] INFO [CoarseGrainedExecutorBackend]: Got assigned task 23
2021-12-27 11:34:41,653 [dispatcher-Executor] INFO [CoarseGrainedExecutorBackend]: Got assigned task 24
2021-12-27 11:34:41,655 [Executor task launch worker for task 21] INFO [Executor]: Running task 21.0 in stage 0.0 (TID 21)
2021-12-27 11:34:41,655 [Executor task launch worker for task 22] INFO [Executor]: Running task 22.0 in stage 0.0 (TID 22)
2021-12-27 11:34:41,655 [Executor task launch worker for task 24] INFO [Executor]: Running task 24.0 in stage 0.0 (TID 24)
2021-12-27 11:34:41,655 [Executor task launch worker for task 23] INFO [Executor]: Running task 23.0 in stage 0.0 (TID 23)
2021-12-27 11:34:41,655 [Executor task launch worker for task 20] INFO [Executor]: Running task 20.0 in stage 0.0 (TID 20)
2021-12-27 11:34:41,675 [Executor task launch worker for task 23] INFO [Executor]: Fetching spark://h07u21:35021/jars/BigStitcher-Spark-0.0.1-SNAPSHOT.jar with timestamp 1640622879317
2021-12-27 11:34:41,698 [Executor task launch worker for task 23] INFO [TransportClientFactory]: Successfully created connection to h07u21/10.36.107.31:35021 after 1 ms (0 ms spent in bootstraps)
2021-12-27 11:34:41,699 [Executor task launch worker for task 23] INFO [Utils]: Fetching spark://h07u21:35021/jars/BigStitcher-Spark-0.0.1-SNAPSHOT.jar to /scratch/moharb/spark-114599123-1/spark-bb5a2dd5-fed1-4caa-83df-fff14cc90256/executor-160b31f0-6981-45c2-b399-7b8db7e3d19a/spark-8e944ff4-3dcf-44f5-84cf-c9773c878a38/fetchFileTemp2825303330217301914.tmp
2021-12-27 11:34:42,182 [Executor task launch worker for task 23] INFO [Utils]: Copying /scratch/moharb/spark-114599123-1/spark-bb5a2dd5-fed1-4caa-83df-fff14cc90256/executor-160b31f0-6981-45c2-b399-7b8db7e3d19a/spark-8e944ff4-3dcf-44f5-84cf-c9773c878a38/-3846531471640622879317_cache to /groups/spruston/home/moharb/.spark/20211227_113327/logs/worker-1-dir/app-20211227113439-0000/0/./BigStitcher-Spark-0.0.1-SNAPSHOT.jar
2021-12-27 11:34:42,597 [Executor task launch worker for task 23] INFO [Executor]: Adding file:/groups/spruston/home/moharb/.spark/20211227_113327/logs/worker-1-dir/app-20211227113439-0000/0/./BigStitcher-Spark-0.0.1-SNAPSHOT.jar to class loader
2021-12-27 11:34:42,773 [Executor task launch worker for task 23] INFO [TorrentBroadcast]: Started reading broadcast variable 0 with 1 pieces (estimated total size 4.0 MiB)
2021-12-27 11:34:42,806 [Executor task launch worker for task 23] INFO [TransportClientFactory]: Successfully created connection to h07u21/10.36.107.31:36127 after 1 ms (0 ms spent in bootstraps)
2021-12-27 11:34:42,825 [Executor task launch worker for task 23] INFO [MemoryStore]: Block broadcast_0_piece0 stored as bytes in memory (estimated size 2.3 KiB, free 39.8 GiB)
2021-12-27 11:34:42,831 [Executor task launch worker for task 23] INFO [TorrentBroadcast]: Reading broadcast variable 0 took 57 ms
2021-12-27 11:34:42,904 [Executor task launch worker for task 23] INFO [MemoryStore]: Block broadcast_0 stored as values in memory (estimated size 4.3 KiB, free 39.8 GiB)
2021-12-27 11:34:43,551 [Executor task launch worker for task 21] ERROR [Executor]: Exception in task 21.0 in stage 0.0 (TID 21)
mpicbg.spim.data.SpimDataInstantiationException: could not find XmlIoBasicImgLoader implementation for format bdv.n5
    at mpicbg.spim.data.generic.sequence.ImgLoaders.createXmlIoForFormat(ImgLoaders.java:72)
    at mpicbg.spim.data.generic.sequence.XmlIoAbstractSequenceDescription.fromXml(XmlIoAbstractSequenceDescription.java:110)
    at mpicbg.spim.data.generic.XmlIoAbstractSpimData.fromXml(XmlIoAbstractSpimData.java:153)
    at net.preibisch.mvrecon.fiji.spimdata.XmlIoSpimData2.fromXml(XmlIoSpimData2.java:164)
    at net.preibisch.mvrecon.fiji.spimdata.XmlIoSpimData2.fromXml(XmlIoSpimData2.java:52)

Maybe it is related to some race condition? Maybe some of this discussion is relevant.

StephanPreibisch commented 2 years ago

@trautmane do you have any ideas? Thanks 😊

mzouink commented 2 years ago

I think it can be solved by adding this line : ImgLoaders.registerManually(XmlIoN5ImageLoader.class); but it is a hack. The XmlIoN5ImageLoader should be added automatically to the BDV loaders register.

StephanPreibisch commented 2 years ago

Could you give this a try @mzouink? That would be awesome ...

mzouink commented 2 years ago

done, @boazmohar, I added the code to a new branch fix_bdv_n5 can you plz try again with this branch code: https://github.com/PreibischLab/BigStitcher-Spark/tree/fix_bdv_n5 If it works then we can know the reason for the bug.

boazmohar commented 2 years ago

Sorry still happaning but not in all workers, some of them are working and I see the right output, some have an issue with loading the right class as before.

image image

I think that for the smaller dataset it is enough retries to get thru.

mzouink commented 2 years ago

we are making progress. now the problem is inside the jobs. can you plz pull and try again. I added a line. https://github.com/PreibischLab/BigStitcher-Spark/tree/fix_bdv_n5

boazmohar commented 2 years ago

Still failing in some

image

Also found a new error

image

Not sure if it a separate issue or not. This is a big dataset.

trautmane commented 2 years ago

Hi @StephanPreibisch and @boazmohar - I'm looking at all of this for the first time now under the assumption that the problem still exists. Please let me know if that assumption is incorrect.

boazmohar commented 2 years ago

Yes, still an issue, thanks for looking into it!

@boazmohar - I just left a question for you on slack ...

boazmohar commented 2 years ago

Can confirm this solved this issue! THANKS @trautmane Here is the result of the non-rigid on a large dataset

image image

Still had some memory issues but i used 5 cores per task and N_TASKS_PER_EXECUTOR_CORE=1000 BTW it took 3.5h with 2011 cores but allocating only 202 tasks at a time. Would be happy to hear if there is a more principled way to do this (tried 10, 30 100, 300, 1000 until it stopped failing more then 4 times per task).

StephanPreibisch commented 2 years ago

Awesome, thanks @trautmane! Could you please merge this into master?

The fix is a change to the spimdata repo and I don't have access to that, so I submitted this pull request. Someone with write access to spimdata ( @tpietzsch , @StephanPreibisch , or ??? ) needs to review and approve my changes. Once a new version of spimdata is deployed, we can update this repo to use it.

I still do not understand why it worked locally with Spark though, do you know?

I did not play with a local Spark version, so I can't definitively answer this - but my guess is that with the local Spark you are just more likely to get lucky and avoid concurrent access to ImgLoaders.build(). When I was testing/debugging, I needed to use multiple worker nodes to regularly encounter the issue. Some initial tests I ran with just one worker node on a small data set failed to reproduce the problem.

thanks again!!

StephanPreibisch commented 2 years ago

You should post the image and statistics on Twitter @boazmohar 😊 Thanks so much for b pushing this!!

Now only the cloud version is missing @mzouink 😊

StephanPreibisch commented 2 years ago

Thanks @trautmane, makes sense ... I didn't realize on my phone that it was in another repo, @tpietzsch has to approve it and cut a new release so we can distribute, I don't want to do it without him. It always creates this ripple-effect through many Fiji plugins ...

Yeah - that was a concern of mine too. @tpietzsch left a TODO comment in January 2021 about thread safety but didn't fix it then. Hopefully, he just didn't have time to fix it but it is also possible he ran into some other ripple-effect problem.

tpietzsch commented 2 years ago

Yeah - that was a concern of mine too. @tpietzsch left a TODO comment in January 2021 about thread safety but didn't fix it then. Hopefully, he just didn't have time to fix it but it is also possible he ran into some other ripple-effect problem.

No, no ripple-effect expected... I just didn't have time to think it through and fix it...

tpietzsch commented 2 years ago

This is now merged and released in spim_data-2.2.7

trautmane commented 2 years ago

This is fixed by https://github.com/PreibischLab/BigStitcher-Spark/commit/01103177028b43b05b8e59dcf9a3f24ed9629591 .