dask / dask-yarn

Deploy dask on YARN clusters
http://yarn.dask.org
BSD 3-Clause "New" or "Revised" License
69 stars 41 forks source link

Application fails to start with `java.lang.ClassNotFoundException` #101

Closed yamrzou closed 4 years ago

yamrzou commented 5 years ago

Starting a Dask YarnCluster fails because of ClassNotFoundException.

Running the following code :

from dask_yarn import YarnCluster 
from dask.distributed import Client 

cluster = YarnCluster(environment='environment.tar.gz', 
                      worker_vcores=2, 
                      worker_memory="8GiB", 
                      queue='root')

Yields:

19/10/08 08:00:51 INFO client.RMProxy: Connecting to ResourceManager at <MASTER_HOSTNAME>:8032
19/10/08 08:00:51 INFO skein.Driver: Driver started, listening on 33435
19/10/08 08:00:52 INFO conf.Configuration: resource-types.xml not found
19/10/08 08:00:52 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
19/10/08 08:00:52 INFO skein.Driver: Uploading application resources to hdfs://<MASTER_HOSTNAME>:9000/user/spark/.skein/<APPLICATION ID>
19/10/08 08:00:57 INFO skein.Driver: Submitting application...
19/10/08 08:00:57 INFO impl.YarnClientImpl: Submitted application <APPLICATION ID>
19/10/08 08:00:58 INFO impl.YarnClientImpl: Killed application <APPLICATION ID>
...
DaskYarnError: Failed to start dask-yarn <APPLICATION ID>
See the application logs for more information

yarn logs -applicationId <APPLICATION ID>

The application logs output:

Container: <CONTAINER ID> on <SLAVE_NODE>_45571
LogAggregationType: AGGREGATED
=============================================================================
LogType:application.master.log
LogLastModifiedTime:Tue Oct 08 08:00:59 +0000 2019
LogLength:938
LogContents:
Error: A JNI error has occurred, please check your installation and try again
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration
    at java.lang.Class.getDeclaredMethods0(Native Method)
    at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
    at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
    at java.lang.Class.getMethod0(Class.java:3018)
    at java.lang.Class.getMethod(Class.java:1784)
    at sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:544)
    at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration
    at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    ... 7 more

End of LogType:application.master.log
***************************************************************************************

Container: <CONTAINER ID> on <SLAVE_NODE>_45571
LogAggregationType: AGGREGATED
=============================================================================
LogType:directory.info
LogLastModifiedTime:Tue Oct 08 08:00:59 +0000 2019
LogLength:1735
LogContents:
ls -l:
total 24
-rw-r--r-- 1 yarn hadoop   75 Oct  8 08:00 container_tokens
-rwx------ 1 yarn hadoop  784 Oct  8 08:00 default_container_executor.sh
-rwx------ 1 yarn hadoop  729 Oct  8 08:00 default_container_executor_session.sh
-rwx------ 1 yarn hadoop 5776 Oct  8 08:00 launch_container.sh
drwx--x--- 2 yarn hadoop 4096 Oct  8 08:00 tmp
find -L . -maxdepth 5 -ls:
 79298787      4 drwx--x---   3 yarn     hadoop       4096 Oct  8 08:00 .
206700763   7660 -r-x------   1 yarn     hadoop    7841651 Oct  8 08:00 ./.skein.jar
 79298799      4 -rw-r--r--   1 yarn     hadoop         16 Oct  8 08:00 ./.default_container_executor.sh.crc
190578918      4 -r-x------   1 yarn     hadoop       1860 Oct  8 08:00 ./.skein.proto
 79298795      4 -rw-r--r--   1 yarn     hadoop         56 Oct  8 08:00 ./.launch_container.sh.crc
190578915      4 -r-x------   1 yarn     hadoop       1708 Oct  8 08:00 ./.skein.pem
 79298798      4 -rwx------   1 yarn     hadoop        784 Oct  8 08:00 ./default_container_executor.sh
 79298788      4 drwx--x---   2 yarn     hadoop       4096 Oct  8 08:00 ./tmp
212468002      4 -r-x------   1 yarn     hadoop       1013 Oct  8 08:00 ./.skein.crt
 79298793      8 -rwx------   1 yarn     hadoop       5776 Oct  8 08:00 ./launch_container.sh
 79298796      4 -rwx------   1 yarn     hadoop        729 Oct  8 08:00 ./default_container_executor_session.sh
 79298792      4 -rw-r--r--   1 yarn     hadoop         12 Oct  8 08:00 ./.container_tokens.crc
 79298789      4 -rw-r--r--   1 yarn     hadoop         75 Oct  8 08:00 ./container_tokens
 79298797      4 -rw-r--r--   1 yarn     hadoop         16 Oct  8 08:00 ./.default_container_executor_session.sh.crc
broken symlinks(find -L . -maxdepth 5 -type l -ls):

End of LogType:directory.info
*******************************************************************************

Container: <CONTAINER ID> on <SLAVE_NODE>_45571
LogAggregationType: AGGREGATED
=============================================================================
LogType:launch_container.sh
LogLastModifiedTime:Tue Oct 08 08:00:59 +0000 2019
LogLength:5776
LogContents:
#!/bin/bash

set -o pipefail -e
export PRELAUNCH_OUT="/usr/local/src/hadoop/logs/userlogs/<APPLICATION ID>/<CONTAINER ID>/prelaunch.out"
exec >"${PRELAUNCH_OUT}"
export PRELAUNCH_ERR="/usr/local/src/hadoop/logs/userlogs/<APPLICATION ID>/<CONTAINER ID>/prelaunch.err"
exec 2>"${PRELAUNCH_ERR}"
echo "Setting up env variables"
export JAVA_HOME=${JAVA_HOME:-"/docker-java-home"}
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/usr/hdp/3.0.0.0-1634/hadoop/conf"}
export HADOOP_YARN_HOME=${HADOOP_YARN_HOME:-"/usr/hdp/3.0.0.0-1634/hadoop-yarn"}
export HADOOP_MAPRED_HOME=${HADOOP_MAPRED_HOME:-"/usr/hdp/3.0.0.0-1634/hadoop-mapreduce"}
export HADOOP_TOKEN_FILE_LOCATION="/home/hadoop/nodemanager/local-dirs-7/usercache/spark/appcache/<APPLICATION ID>/<CONTAINER ID>/container_tokens"
export CONTAINER_ID="<CONTAINER ID>"
export NM_PORT="45571"
export NM_HOST="<SLAVE_NODE>"
export NM_HTTP_PORT="8042"
export LOCAL_DIRS="/home/hadoop/nodemanager/local-dirs-1/usercache/spark/appcache/<APPLICATION ID>,/home/hadoop/nodemanager/local-dirs-2/usercache/spark/appcache/<APPLICATION ID>,/home/hadoop/nodemanager/local-dirs-3/usercache/spark/appcache/<APPLICATION ID>,/home/hadoop/nodemanager/local-dirs-4/usercache/spark/appcache/<APPLICATION ID>,/home/hadoop/nodemanager/local-dirs-5/usercache/spark/appcache/<APPLICATION ID>,/home/hadoop/nodemanager/local-dirs-6/usercache/spark/appcache/<APPLICATION ID>,/home/hadoop/nodemanager/local-dirs-7/usercache/spark/appcache/<APPLICATION ID>,/home/hadoop/nodemanager/local-dirs-8/usercache/spark/appcache/<APPLICATION ID>"
export LOCAL_USER_DIRS="/home/hadoop/nodemanager/local-dirs-1/usercache/spark/,/home/hadoop/nodemanager/local-dirs-2/usercache/spark/,/home/hadoop/nodemanager/local-dirs-3/usercache/spark/,/home/hadoop/nodemanager/local-dirs-4/usercache/spark/,/home/hadoop/nodemanager/local-dirs-5/usercache/spark/,/home/hadoop/nodemanager/local-dirs-6/usercache/spark/,/home/hadoop/nodemanager/local-dirs-7/usercache/spark/,/home/hadoop/nodemanager/local-dirs-8/usercache/spark/"
export LOG_DIRS="/usr/local/src/hadoop/logs/userlogs/<APPLICATION ID>/<CONTAINER ID>"
export USER="spark"
export LOGNAME="spark"
export HOME="/home/"
export PWD="/home/hadoop/nodemanager/local-dirs-7/usercache/spark/appcache/<APPLICATION ID>/<CONTAINER ID>"
export JVM_PID="$$"
export MALLOC_ARENA_MAX="4"
export NM_AUX_SERVICE_spark_shuffle=""
export NM_AUX_SERVICE_mapreduce_shuffle="AAA0+gAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA="
export APPLICATION_WEB_PROXY_BASE="/proxy/<APPLICATION ID>"
export SKEIN_APPLICATION_ID="<APPLICATION ID>"
export CLASSPATH="$CLASSPATH:./*:/etc/hadoop/conf:/usr/hdp/3.0.0.0-1634/hadoop-client/*:/usr/hdp/3.0.0.0-1634/hadoop-client/lib/*:/usr/hdp/3.0.0.0-1634/hadoop-hdfs-client/*:/usr/hdp/3.0.0.0-1634/hadoop-hdfs-client/lib/*:/usr/hdp/3.0.0.0-1634/hadoop-yarn-client/*:/usr/hdp/3.0.0.0-1634/hadoop-yarn-client/lib/*"
export LANG="C.UTF-8"
export APP_SUBMIT_TIME_ENV="1570521657344"
export HADOOP_USER_NAME="spark"
echo "Setting up job resources"
ln -sf "/home/hadoop/nodemanager/local-dirs-4/usercache/spark/appcache/<APPLICATION ID>/filecache/11/.skein.pem" ".skein.pem"
ln -sf "/home/hadoop/nodemanager/local-dirs-2/usercache/spark/appcache/<APPLICATION ID>/filecache/12/.skein.crt" ".skein.crt"
ln -sf "/home/hadoop/nodemanager/local-dirs-8/usercache/spark/appcache/<APPLICATION ID>/filecache/10/skein.jar" ".skein.jar"
ln -sf "/home/hadoop/nodemanager/local-dirs-4/usercache/spark/appcache/<APPLICATION ID>/filecache/13/.skein.proto" ".skein.proto"
echo "Copying debugging information"
# Creating copy of launch script
cp "launch_container.sh" "/usr/local/src/hadoop/logs/userlogs/<APPLICATION ID>/<CONTAINER ID>/launch_container.sh"
chmod 640 "/usr/local/src/hadoop/logs/userlogs/<APPLICATION ID>/<CONTAINER ID>/launch_container.sh"
# Determining directory contents
echo "ls -l:" 1>"/usr/local/src/hadoop/logs/userlogs/<APPLICATION ID>/<CONTAINER ID>/directory.info"
ls -l 1>>"/usr/local/src/hadoop/logs/userlogs/<APPLICATION ID>/<CONTAINER ID>/directory.info"
echo "find -L . -maxdepth 5 -ls:" 1>>"/usr/local/src/hadoop/logs/userlogs/<APPLICATION ID>/<CONTAINER ID>/directory.info"
find -L . -maxdepth 5 -ls 1>>"/usr/local/src/hadoop/logs/userlogs/<APPLICATION ID>/<CONTAINER ID>/directory.info"
echo "broken symlinks(find -L . -maxdepth 5 -type l -ls):" 1>>"/usr/local/src/hadoop/logs/userlogs/<APPLICATION ID>/<CONTAINER ID>/directory.info"
find -L . -maxdepth 5 -type l -ls 1>>"/usr/local/src/hadoop/logs/userlogs/<APPLICATION ID>/<CONTAINER ID>/directory.info"
echo "Launching container"
exec /bin/bash -c "$JAVA_HOME/bin/java -Xmx128M -Dskein.log.level=INFO -Dskein.log.directory=/usr/local/src/hadoop/logs/userlogs/<APPLICATION ID>/<CONTAINER ID> com.anaconda.skein.ApplicationMaster hdfs://<MASTER_HOSTNAME>:9000/user/spark/.skein/<APPLICATION ID> >/usr/local/src/hadoop/logs/userlogs/<APPLICATION ID>/<CONTAINER ID>/application.master.log 2>&1"

End of LogType:launch_container.sh
************************************************************************************

End of LogType:prelaunch.err
******************************************************************************

Container: <CONTAINER ID> on <SLAVE_NODE>_45571
LogAggregationType: AGGREGATED
=============================================================================
LogType:prelaunch.out
LogLastModifiedTime:Tue Oct 08 08:00:59 +0000 2019
LogLength:100
LogContents:
Setting up env variables
Setting up job resources
Copying debugging information
Launching container

End of LogType:prelaunch.out
******************************************************************************

When using a skein client, I get the same exception with a different class not found:

import skein
from dask_yarn import YarnCluster

client = skein.Client(log_level='debug')
cluster = YarnCluster(environment='environment.tar.gz', 
                      worker_vcores=2, 
                      worker_memory="8GiB", 
                      queue='root',
                      skein_client=client)

The application logs output:

Error: A JNI error has occurred, please check your installation and try again
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/io/DataOutputBuffer
    at java.lang.Class.getDeclaredMethods0(Native Method)
    at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
    at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
    at java.lang.Class.getMethod0(Class.java:3018)
    at java.lang.Class.getMethod(Class.java:1784)
    at sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:544)
    at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.io.DataOutputBuffer
    at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    ... 7 more

Any Idea? Thank you.

Version information

jcrist commented 5 years ago

Hmmm, interesting. All of the classes not found are definitely part of your Hadoop distribution (org/apache/hadoop/conf/Configuration for example is the standard configuration class). The difference in classes is just non-deterministic behavior in Java's class loader.

It's odd that we're seeing JNI errors. Skein doesn't invoke the JNI explicitly, but Hadoop does try to load a native library if available (libhadoop) before falling back on a Java implementation, perhaps this has something to do with it?.

Since the driver starts fine and you're seeing errors in the application only, I suspect there are differences between your edge node environment and your worker node environment. Are the hadoop libraries in a different location on your worker nodes than they are on the edge node? As per the hadoop documentation, we set the application classpath based on the the edge node environment (https://hadoop.apache.org/docs/r3.1.2/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html) - if this classpath isn't valid on the worker node you may get class loading issues as seen above.


Since this isn't a dask-yarn specific issue, here's a smaller example script you can try to make debugging simpler:

import skein

spec = skein.ApplicationSpec.from_yaml("""
name: debug-skein
queue: root

master:
  script: echo "Things worked!"
""")

client = skein.Client()
client.submit(spec)

Running this will submit a small application, which should complete successfully (but in your case should fail with the same issues as above).

yamrzou commented 5 years ago

Thanks a lot for your input. I checked hadoop classpath and hadoop envvars, both give the same output for the edge node and the worker nodes. I suspect it might be related to libhadoop as you said, but it might take me some time before I can test that, I will report back once done.

yamrzou commented 4 years ago

Hi,

I re-tested this on a newly created Hadoop cluster and it worked without problems. The issue was very likely due to a configuration mismatch between the edge node and the worker nodes, as that was fixed in the new cluster.

Closing the issue.