JaneliaSciComp / BigStitcher-Spark

Running compute-intense parts of BigStitcher distributed
BSD 2-Clause "Simplified" License
18 stars 10 forks source link

Error Could not initialize class ch.systemsx.cisd.hdf5.CharacterEncoding on AffineExport on h5 file #8

Closed boazmohar closed 2 years ago

boazmohar commented 2 years ago

Hi @StephanPreibisch,

I am trying to do an AffineExport with spark:

~/spark-janelia/flintstone.sh 4 \
/groups/spruston/home/moharb/BigStitcher-Spark/target/BigStitcher-Spark-0.0.2-SNAPSHOT.jar \ 
net.preibisch.bigstitcher.spark.AffineFusion \
-x '/groups/mousebrainmicro/mousebrainmicro/data/Lightsheet/20210812_AG/ML_Rendering-test/aligned_data.xml' \
-o  '/nrs/svoboda/moharb/test_ML.n5' -d '/s0' 

And get this error:

2022-04-21 15:45:37,731 [task-result-getter-0] ERROR [TaskSetManager]: Task 1 in stage 0.0 failed 4 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 78, 10.36.107.42, executor 0): java.lang.NoClassDefFoundError: Could not initialize class ch.systemsx.cisd.hdf5.CharacterEncoding
    at ch.systemsx.cisd.hdf5.HDF5BaseReader.<init>(HDF5BaseReader.java:143)
    at ch.systemsx.cisd.hdf5.HDF5BaseReader.<init>(HDF5BaseReader.java:126)
    at ch.systemsx.cisd.hdf5.HDF5ReaderConfigurator.reader(HDF5ReaderConfigurator.java:86)
    at ch.systemsx.cisd.hdf5.HDF5FactoryProvider$HDF5Factory.openForReading(HDF5FactoryProvider.java:54)
    at ch.systemsx.cisd.hdf5.HDF5Factory.openForReading(HDF5Factory.java:55)
    at bdv.img.hdf5.Hdf5ImageLoader.open(Hdf5ImageLoader.java:183)
    at bdv.img.hdf5.Hdf5ImageLoader.getSetupImgLoader(Hdf5ImageLoader.java:381)
    at bdv.img.hdf5.Hdf5ImageLoader.getSetupImgLoader(Hdf5ImageLoader.java:79)
    at net.preibisch.bigstitcher.spark.util.ViewUtil.getTransformedBoundingBox(ViewUtil.java:32)
    at net.preibisch.bigstitcher.spark.AffineFusion.lambda$call$7b7a6284$1(AffineFusion.java:268)
    at org.apache.spark.api.java.JavaRDDLike.$anonfun$foreach$1(JavaRDDLike.scala:351)
    at org.apache.spark.api.java.JavaRDDLike.$anonfun$foreach$1$adapted(JavaRDDLike.scala:351)
    at scala.collection.Iterator.foreach(Iterator.scala:941)
    at scala.collection.Iterator.foreach$(Iterator.scala:941)
    at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
    at org.apache.spark.rdd.RDD.$anonfun$foreach$2(RDD.scala:986)
    at org.apache.spark.rdd.RDD.$anonfun$foreach$2$adapted(RDD.scala:986)
    at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2139)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:127)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

I can open it in Fiji and look at the data with BigStitcher without an issue. The xml is in: /groups/mousebrainmicro/mousebrainmicro/data/Lightsheet/20210812_AG/ML_Rendering-test/aligned_data.xml Any idea what to do? Found this, might be related.

Thanks, Boaz

carshadi commented 2 years ago

Hi @boazmohar and @StephanPreibisch , I'm getting the same error on a SLURM cluster running a standalone spark cluster.

java info:

openjdk version "1.8.0_332"
OpenJDK Runtime Environment (Zulu 8.62.0.19-CA-linux64) (build 1.8.0_332-b09)
OpenJDK 64-Bit Server VM (Zulu 8.62.0.19-CA-linux64) (build 25.332-b09, mixed mode)

mvn:

Maven home: /home/cameron.arshadi/opt/apache-maven-3.8.5
Java version: 1.8.0_332, vendor: Azul Systems, Inc., runtime: /allen/scratch/aindtemp/cameron.arshadi/tools/jvm/zulu8.62.0.19-ca-jdk8.0.332-linux_x64/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "3.10.0-1160.15.2.el7.x86_64", arch: "amd64", family: "unix"

submit command:

spark-submit --master ${MASTER_URL} \
             --total-executor-cores $((SLURM_NTASKS * SLURM_CPUS_PER_TASK)) \
             --class net.preibisch.bigstitcher.spark.AffineFusion \
             --deploy-mode client \
             --verbose \
             --conf spark.executor.instances=${SLURM_NTASKS_PER_NODE} \
             --conf spark.executor.cores=${SLURM_CPUS_PER_TASK} \
             --conf spark.executor.memory=${SPARK_MEM} \
             --conf spark.default.parallelism=${PARALLELISM} \
             /allen/scratch/aindtemp/cameron.arshadi/tools/jars/BigStitcher-Spark-0.0.2-SNAPSHOT.jar \
             -x "/allen/scratch/aindtemp/data/anatomy/exm-hemi-brain/aligned_data.xml" \
             -o "/allen/scratch/aindtemp/data/anatomy/exm-hemi-brain-fused.n5" \
             -d "/ch0/s0" \
             --blockSize "256,256,256" \
             --preserveAnisotropy \
             --UINT16 \
             --minIntensity 0.0 \
             --maxIntensity 65535.0 \
             --channelId 0

Using spark-3.2.1

This didn't happen when running locally with --master local[32]

StephanPreibisch commented 2 years ago

@trautmane, do you have some time to look at that?

StephanPreibisch commented 2 years ago

@mkitti - that sounds familiar, did we discuss this? The problem that HDF5 creates a local tmp directory?

mkitti commented 2 years ago

I will check the pom this afternoon.

StephanPreibisch commented 2 years ago

thanks @mkitti!

mkitti commented 2 years ago

ch.systemsx.cisd.hdf5.CharacterEncoding definitely does exist: https://sissource.ethz.ch/sispub/jhdf5/-/blob/master/source/java/ch/systemsx/cisd/hdf5/CharacterEncoding.java

mkitti commented 2 years ago

The reported line number is slightly off. CharacterEncoding should be on line 141 https://sissource.ethz.ch/sispub/jhdf5/-/blob/master/source/java/ch/systemsx/cisd/hdf5/HDF5BaseReader.java#L141

mkitti commented 2 years ago

We may need to take a close look at your classpaths. Also either of you are running on Debian or Ubuntu? Is it possible that you have on old version of the libsis-jhdf5-java Debian installed and present on your default classpath?

mkitti commented 2 years ago

The current pom actually imports jhdf5 14.12.6. The above source links are for 19.04.

mkitti commented 2 years ago

Line 143 lines up with older jhdf5 source at https://svnsis.ethz.ch/repos/cisd/jhdf5/trunk/source/java/ch/systemsx/cisd/hdf5/HDF5BaseReader.java https://svnsis.ethz.ch/repos/cisd/jhdf5/tags/release/14.12.x/14.12.6/jhdf5/source/java/ch/systemsx/cisd/hdf5/HDF5BaseReader.java

        this.encodingForNewDataSets =
                useUTF8CharEncoding ? CharacterEncoding.UTF8 : CharacterEncoding.ASCII;
carshadi commented 2 years ago

Hi @mkitti ,

echo $CLASSPATH returns an empty string for me.

cat /etc/os-release

NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

ldconfig -p | grep libsis-jhdf5-java returns nothing on the cluster login node

boazmohar commented 2 years ago

This is @mkitti with @boazmohar: The problem is, as @trautmane found before, [lib]jhdf5.so getting extracted to a common temporary directory when parallel jobs are run. Multiple workers may try to extract the native shared library to a common directory, creating a problem.

Per https://unlimited.ethz.ch/display/JHDF/JHDF5+FAQ#JHDF5FAQ-Whataretheoptionstoprovidethenativelibraries? we can provide a JVM option to point Java to a pre-extracted location of the file.

In @boazmohar's case, we prepended SUBMIT_ARGS="--conf spark.executor.extraJavaOptions=-Dnative.libpath.jhdf5=/groups/spruston/home/moharb/libjhdf5.so", which fixes this issue.

We extracted libjhdf5.so from native\jhdf5\amd64-Linux inside the jhdf5 JAR file which you can open up as a zip file.

carshadi commented 2 years ago

Confirming the above also works on my end

--conf "spark.executor.extraJavaOptions=-Dnative.libpath.jhdf5=/allen/scratch/aindtemp/cameron.arshadi/tools/lib/libjhdf5.so" 
mkitti commented 2 years ago

It may be useful to considering using native.caching.libpath here. If the jhdf5 library does not exist, then this will extract it to the specified path. If it does exist, it will check the version and refresh it if needed. The currently extracted version is correct, it will just use that.