Open quartox opened 7 years ago
@quartox apologies for the delayed response.
Do you mean that the YARN UI is showing that response ? Can you querying the status/log from the CLI ?
yarn application -status application_1495984298947_16153 yarn logs -applicationId application_1495984298947_16153
Can you also post the entire output from dask-yarn ?
That was the full command line output (it never died on the command line). I just re-ran the job and got the following output:
command line
ClientConfig(127.0.0.1,52457)
17/06/06 08:40:58 INFO knit.Client$: Staring Application Master
Setting Replication Factor to: 3
Attemping upload of /home/jlord/.conda/envs/dask/lib/python3.5/site-packages/knit/java_libs/knit-1.0-SNAPSHOT.jar
Uploading resource file:/home/jlord/.conda/envs/dask/lib/python3.5/site-packages/knit/java_libs/knit-1.0-SNAPSHOT.jar -> hdfs://nameservice1/user/jlord/.knitDeps/knit-1.0-SNAPSHOT.jarhdfs://nameservice1/user/jlord/.knitDeps/knit-1.0-SNAPSHOT.jar
Attemping upload of /home/jlord/.conda/envs/dask/lib/python3.5/site-packages/knit/tmp_conda/miniconda/envs/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58.zip
Uploading resource file:/home/jlord/.conda/envs/dask/lib/python3.5/site-packages/knit/tmp_conda/miniconda/envs/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58.zip -> hdfs://nameservice1/user/jlord/.knitDeps/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58.zip17/06/06 08:41:07 INFO knit.Client$: Submitting application application_1496613553830_6169
17/06/06 08:41:07 INFO impl.YarnClientImpl: Submitted application application_1496613553830_6169
yarn application -status
Application Report :
Application-Id : application_1496613553830_6169
Application-Name : knit
Application-Type : YARN
User : jlord
Queue : root.default
Start-Time : 1496756467259
Finish-Time : 1496756467998
Progress : 0%
State : FAILED
Final-State : FAILED
Tracking-URL : http://hostname.allstate.com:8088/cluster/app/application_1496613553830_6169
RPC Port : -1
AM Host : N/A
Aggregate Resource Allocation : 954 MB-seconds, 0 vcore-seconds
Log Aggregation Status : NOT_START
Diagnostics : Application application_1496613553830_6169 failed 3 times due to AM Container for appattempt_1496613553830_6169_000003 exited with exitCode: -1000
For more detailed output, check application tracking page:http://hostname.allstate.com:8088/proxy/application_1496613553830_6169/Then, click on links to logs of each attempt.
Diagnostics: org.apache.hadoop.security.AccessControlException: Permission denied: user=yarn, access=EXECUTE, inode="/user/jlord":jlord:hdpqra:drwxr-x---
at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkFsPermission(DefaultAuthorizationProvider.java:279)
at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.check(DefaultAuthorizationProvider.java:260)
at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkTraverse(DefaultAuthorizationProvider.java:201)
at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkPermission(DefaultAuthorizationProvider.java:154)
at org.apache.sentry.hdfs.SentryAuthorizationProvider.checkPermission(SentryAuthorizationProvider.java:178)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:152)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:3529)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6634)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:4277)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:901)
at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getFileInfo(AuthorizationProviderProxyClientProtocol.java:528)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:839)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2216)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2212)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2210)
Failing this attempt. Failing the application.
yarn logs -applicationId
/tmp/logs/jlord/logs/application_1496613553830_6169does not exist.
Log aggregation has not completed or is not enabled
Thanks for providing the logs!
It's a little hard to see but the job failed because of a permission error:
Diagnostics: org.apache.hadoop.security.AccessControlException: Permission denied: user=yarn, access=EXECUTE, inode="/user/jlord":jlord:hdpqra:drwxr-x---
I'm not sure why the user yarn
is trying to write to (what I assume is your user's) jlord
directory. What should happen is that jlord
should be writing to jlord's
directory. Specifically, to:
/user/jlord/.knitDeps/
We use kerbores and sentry for security. Any hive job looks like it is being written by the hive user. I will try creating that directory and opening up write permission.
It successfully wrote /user/jlord/.knitDeps/
but it failed to execute /user/jlord/
so I opened up permission to my home directory to be executed by all and it worked. I am guessing this is a sentry issue, but I barely understand how sentry operates.
The job still does not finish but the new output is:
yarn application -status
Application Report :
Application-Id : application_1496613553830_6188
Application-Name : knit
Application-Type : YARN
User : jlord
Queue : root.default
Start-Time : 1496757595631
Finish-Time : 0
Progress : 10%
State : RUNNING
Final-State : UNDEFINED
Tracking-URL : N/A
RPC Port : 45533
AM Host : 10.195.102.85
Aggregate Resource Allocation : 742292 MB-seconds, 359 vcore-seconds
Log Aggregation Status : NOT_START
Diagnostics :
But looking into the container logs:
17/06/06 09:00:05 ERROR knit.ApplicationMaster$: Received error
java.io.IOException: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "hostname1.allstate.com/10.195.102.85"; destination host is: "hostname2.allstate.com":8020;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
at org.apache.hadoop.ipc.Client.call(Client.java:1476)
at org.apache.hadoop.ipc.Client.call(Client.java:1409)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
at com.sun.proxy.$Proxy16.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
at com.sun.proxy.$Proxy17.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2123)
at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1253)
at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1249)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1249)
at io.continuum.knit.Utils$.setUpLocalResource(Utils.scala:23)
at io.continuum.knit.ApplicationMaster$$anonfun$onContainersAllocated$1.apply(ApplicationMaster.scala:176)
at io.continuum.knit.ApplicationMaster$$anonfun$onContainersAllocated$1.apply(ApplicationMaster.scala:140)
at scala.collection.Iterator$class.foreach(Iterator.scala:743)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1174)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at io.continuum.knit.ApplicationMaster$.onContainersAllocated(ApplicationMaster.scala:140)
at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:292)
Caused by: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:688)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:651)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:739)
at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:376)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1525)
at org.apache.hadoop.ipc.Client.call(Client.java:1448)
... 25 more
Caused by: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212)
at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:413)
at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:561)
at org.apache.hadoop.ipc.Client$Connection.access$1900(Client.java:376)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:731)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:727)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:726)
... 28 more
Caused by: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)
at sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:147)
at sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:121)
at sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:187)
at sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:223)
at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:212)
at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179)
at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:193)
... 37 more
Exception in thread "AMRM Callback Handler Thread" org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "hostname1.allstate.com/10.195.102.85"; destination host is: "hostname2.allstate.com":8020;
at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:299)
Caused by: java.io.IOException: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "hostname1.allstate.com/10.195.102.85"; destination host is: "hostname2.allstate.com":8020;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
at org.apache.hadoop.ipc.Client.call(Client.java:1476)
at org.apache.hadoop.ipc.Client.call(Client.java:1409)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
at com.sun.proxy.$Proxy16.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
at com.sun.proxy.$Proxy17.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2123)
at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1253)
at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1249)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1249)
at io.continuum.knit.Utils$.setUpLocalResource(Utils.scala:23)
at io.continuum.knit.ApplicationMaster$$anonfun$onContainersAllocated$1.apply(ApplicationMaster.scala:176)
at io.continuum.knit.ApplicationMaster$$anonfun$onContainersAllocated$1.apply(ApplicationMaster.scala:140)
at scala.collection.Iterator$class.foreach(Iterator.scala:743)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1174)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at io.continuum.knit.ApplicationMaster$.onContainersAllocated(ApplicationMaster.scala:140)
at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:292)
Caused by: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:688)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:651)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:739)
at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:376)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1525)
at org.apache.hadoop.ipc.Client.call(Client.java:1448)
... 25 more
Caused by: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212)
at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:413)
at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:561)
at org.apache.hadoop.ipc.Client$Connection.access$1900(Client.java:376)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:731)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:727)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:726)
... 28 more
Caused by: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)
at sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:147)
at sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:121)
at sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:187)
at sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:223)
at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:212)
at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179)
at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:193)
... 37 more
I need to do a review of how were handling kerberos. For now, can you confirm that you have a kerberos ticket on the machine you are submitting from ?
klist
Thanks again for helping us work through these issues.
Yes, I reproduced it with a kerberos ticket. I automatically create one in my .bashrc, so I am pretty confident that it was there before as well.
Reading around, the most obvious thing to look at first is the versions of java and kerb in your running environment versus the yarn machine's. Unfortunately, the trace above does not say what was happening to cause the failure. If this was during reading the knit files, can you check their permissions? (if user 'yarn' didn't initially have exec access on the directory, high chance it doesn't have read access to the files) (there seem many ways that you can get that same error message)
I will add the top of the stderr log while I check on the other details you asked about.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.10.1-1.cdh5.10.1.p0.10/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/hadoop10/yarn/nm/filecache/17/knit-1.0-SNAPSHOT.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
17/06/28 14:28:36 INFO knit.ApplicationMaster$: Construct application master
17/06/28 14:28:37 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/06/28 14:28:38 INFO impl.NMClientAsyncImpl: Upper bound of the thread pool size is 500
17/06/28 14:28:38 INFO knit.ApplicationMaster$: Init application master
17/06/28 14:28:38 INFO knit.ApplicationMaster$: 0 containers still running, requesting 2
17/06/28 14:28:38 INFO knit.ApplicationMaster$: Add 2 containers
17/06/28 14:28:38 INFO knit.ApplicationMaster$: Request: 500 MB of RAM and 1 cores
17/06/28 14:28:38 INFO knit.ApplicationMaster$: Requested container resources: <memory:500, vCores:1>
17/06/28 14:28:38 INFO knit.ApplicationMaster$: Requested container ask: Capability[<memory:500, vCores:1>]Priority[0]
17/06/28 14:28:38 INFO knit.ApplicationMaster$: Requested container resources: <memory:500, vCores:1>
17/06/28 14:28:38 INFO knit.ApplicationMaster$: Requested container ask: Capability[<memory:500, vCores:1>]Priority[0]
17/06/28 14:28:40 INFO knit.ApplicationMaster$: Running command in container: $PYTHON_BIN $CONDA_PREFIX/bin/dask-worker --nprocs=1 --nthreads=1 --memory-limit=500000000 tcp://10.195.208.189:46305 > /tmp/worker-log.out 2> /tmp/worker-log.err
17/06/28 14:28:40 WARN security.UserGroupInformation: PriviledgedActionException as:jlord (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
17/06/28 14:28:40 WARN ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
17/06/28 14:28:40 WARN security.UserGroupInformation: PriviledgedActionException as:jlord (auth:KERBEROS) cause:java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
17/06/28 14:28:40 WARN security.UserGroupInformation: PriviledgedActionException as:jlord (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
17/06/28 14:28:40 WARN ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
17/06/28 14:28:40 WARN security.UserGroupInformation: PriviledgedActionException as:jlord (auth:KERBEROS) cause:java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
17/06/28 14:28:40 INFO retry.RetryInvocationHandler: Exception while invoking getFileInfo of class ClientNamenodeProtocolTranslatorPB over hostname1.allstate.com/10.195.208.187:8020 after 1 fail over attempts. Trying to fail over immediately.
java.io.IOException: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "hostname2.allstate.com/10.195.103.37"; destination host is: "hostname1.allstate.com":8020;
I had previously made the .knitDeps
directory permissions drwxrwxrwx
. The jar file and dask zip have -rw-r--r--
.
Also it looks like those uploaded successfully since it makes it to running the dask-worker code.
In terms of the java version, I built knit with the conda version of java-jdk, so the jar may have a different version from the cluster. I can try rebuilding knit with the default java.
I would be surprised if the kerberos versions differed between edge nodes and worker nodes, but I should check it out because I can't be certain.
I tried re-building knit with a local version of java and got the exact same error.
I found the specifics from the process that is running the jar:
/nas/isg_prodops_work/Analytics_COE/SW/Installs/jdk1.8.0_101/bin/java -Xmx1000m -Dhadoop.log.dir=/opt/cloudera/parcels/CDH-5.10.1-1.cdh5.10.1.p0.10/lib/hadoop/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/opt/cloudera/parcels/CDH-5.10.1-1.cdh5.10.1.p0.10/lib/hadoop -Dhadoop.id.str= -Dhadoop.root.logger=INFO,console -Djava.library.path=/opt/cloudera/parcels/CDH-5.10.1-1.cdh5.10.1.p0.10/lib/hadoop/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.RunJar /home/jlord/.conda/envs/dask/lib/python3.5/site-packages/knit/java_libs/knit-1.0-SNAPSHOT.jar io.continuum.knit.Client --callbackHost 127.0.0.1 --callbackPort 51423
I am assuming most of this was generated automatically based on our cluster settings, but the hadoop-policy.xml
kinds of stands out as having no path. Not sure if you can find anything helpful in there.
@danielfrg , do you see anything obvious in the symptoms and error logs above, by any chance?
No more than noticing that is a kerberos (related) issue :) I don't have a lot experience with a kerberized cluster so I have not seen this issues much.
I have tried changing the HADOOP_CONF_DIR
env var under the assumption that the configuration does not know where to find my kerberos ticket cache and I am getting a parsing error:
Traceback (most recent call last):
File "test.py", line 2, in <module>
cluster = YARNCluster()
File "/nas/isg_prodops_work/jlord/learning/dask/dask-yarn/dask_yarn/core.py", line 49, in __init__
self.knit = Knit(autodetect=True, validate=validate)
File "/home/jlord/.conda/envs/dask/lib/python3.5/site-packages/knit/core.py", line 79, in __init__
self.nn, self.nn_port = self._hdfs_conf(autodetect)
File "/home/jlord/.conda/envs/dask/lib/python3.5/site-packages/knit/core.py", line 169, in _hdfs_conf
conf = parse_xml(core_site, 'fs.defaultFS')
File "/home/jlord/.conda/envs/dask/lib/python3.5/site-packages/knit/utils.py", line 35, in parse_xml
conf['port'] = int(conf['port'])
ValueError: invalid literal for int() with base 10: '//nameservice1'
I have tried a few different directories and each one gives the same results. Any ideas why it isn't parsing correctly and/or how to try to fix it?
Is this in the new branch of https://github.com/dask/knit/pull/66 (where dask-yarn is included as part of knit)? I had put some work there into various ways to percolate parameters through (and you should be able to always override basic settings in a conf file with keyword parameters).
No, this is old. I will start working with that version.
Using the merge_dask_yarn
branch I get the same parsing error. Code is
from knit.dask_yarn import DaskYARNCluster
cluster = DaskYARNCluster()
Error:
Traceback (most recent call last):
File "test.py", line 2, in <module>
cluster = DaskYARNCluster()
File "/home/jlord/.conda/envs/dask/lib/python3.5/site-packages/knit/dask_yarn.py", line 62, in __init__
user=user, validate=validate, autodetect=autodetect)
File "/home/jlord/.conda/envs/dask/lib/python3.5/site-packages/knit/core.py", line 97, in __init__
self._hdfs_conf(autodetect, validate)
File "/home/jlord/.conda/envs/dask/lib/python3.5/site-packages/knit/core.py", line 175, in _hdfs_conf
conf = parse_xml(f, 'fs.defaultFS')
File "/home/jlord/.conda/envs/dask/lib/python3.5/site-packages/knit/utils.py", line 35, in parse_xml
conf['port'] = int(conf['port'])
ValueError: invalid literal for int() with base 10: '//nameservice1'
Exception ignored in: <bound method Knit.__del__ of Knit<NN=None:None;RM=None:None>>
Traceback (most recent call last):
File "/home/jlord/.conda/envs/dask/lib/python3.5/site-packages/knit/core.py", line 465, in __del__
if self.app_id is not None:
AttributeError: 'Knit' object has no attribute 'app_id'
OK, this I can fix - I'll get back to you on Tuesday. Does your config file specify a port at all?
The fs.defaultFS
is hdfs://nameservice1
Does that then mean that the default HDFS namenode port (8020) is being used?
I can't find any port in core-site.xml
. In hdfs-site.xml
there are a few options that mention the nameservice port. I changed the hostnames, but this is the relevant section:
<property>
<name>dfs.nameservices</name>
<value>nameservice1</value>
</property>
<property>
<name>dfs.ha.namenodes.nameservice1</name>
<value>namenode53,namenode941</value>
</property>
<property>
<name>dfs.namenode.rpc-address.nameservice1.namenode53</name>
<value>hostname.allstate.com:8020</value>
</property>
<property>
<name>dfs.namenode.servicerpc-address.nameservice1.namenode53</name>
<value>hostname.allstate.com:8022</value>
</property>
<property>
<name>dfs.namenode.http-address.nameservice1.namenode53</name>
<value>hostname.allstate.com:50070</value>
</property>
<property>
<name>dfs.namenode.https-address.nameservice1.namenode53</name>
<value>hostname.allstate.com:50470</value>
</property>
<property>
<name>dfs.namenode.rpc-address.nameservice1.namenode941</name>
<value>hostname2.allstate.com:8020</value>
</property>
<property>
<name>dfs.namenode.servicerpc-address.nameservice1.namenode941</name>
<value>hostname2.allstate.com:8022</value>
</property>
<property>
<name>dfs.namenode.http-address.nameservice1.namenode941</name>
<value>hostname2.allstate.com:50070</value>
</property>
<property>
<name>dfs.namenode.https-address.nameservice1.namenode941</name>
<value>hostname2.allstate.com:50470</value>
</property>
Interesting, this must a multiple-server configuration. Your default FS is "nameservice1", which is referenced through to namenode53 and namenode941, so the actual locations are hostname.allstate.com:8020 and hostname2.allstate.com:8020 .
This is similar to the High-Availability discussion in http://hdfs3.readthedocs.io/en/latest/hdfs.html#high-availability-mode .
OK, so that's all very interesting, but now I'm looking through the code, I can't actually see where the value of the port is being used (aside from my own hdfs3 additions which are optional). Maybe the yarn and hdfs clients can be assumed to read the parameters correctly on the scala side and none of this matters? What happens if you suppress all the file-finding and conversion exceptions, such that the port is still None when calling into the java gateway? What happens if you explicitly pass nn_port=8020
?
Passing nn_port=8020
clears the config reading error but gets back to the kerberos error. I was hoping the config would fix the kerberos error but obviously not.
I still can't figure out how it is able to upload files but then fails to run python on the worker. The upload must be using the kerberos ticket while the application is not.
I note that it is the dask worker that is giving the error: the yarn application master must have launched successfully to have created a container, but it doesn't have credentials. Code snippets like this suggest that that the ApplicationMaster must pass in the HDFS (delegation) token, but we are not (cf. https://github.com/dask/knit/blob/master/knit_jvm/src/main/scala/io/continuum/knit/ApplicationMaster.scala#L185 ). Since I don't have a kerberized setup to play with at the moment, I am shooting in the dark, and my colleague assures me that the kerberos auth was working previously (or still is for other setups).
@martindurant, when I lasted tested with kerberos things they generally worked, however, this was naively done. I would suggest looking at the spark codebase -- specifically, around UGI and keytabs.
Steve Loughran's gitbook on kerberos is also really good resource
Probably the best thing to do is to setup a kerberized cluster again
@quartox, we may be able to glean a bit more information if you run with a few more debug options:
export HADOOP_OPTS="-Djava.net.preferIPv4Stack=true -Dsun.security.krb5.debug=true ${HADOOP_OPTS}"
@quasiben , thanks for chiming in, I was going to ask you separately. Let's table setting up a cluster until after the busy two weeks ahead, and perhaps @quartox hits on a solution in the meantime.
@quasiben @martindurant would you mind sending me contact info so I can share the debug output offline? My email is on my github home page.
One detail from Steve Loughran's gitbook: our kerberos cluster uses Microsoft Active Directory, and as he puts it: "For Developers, it is worth knowing that AD is subtly different from the MIT/Unix Kerberos controller, enough so that you should really test with a cluster using AD as the Kerbers infrastructure, alongside the MIT KDC."
Recap of where we are:
hdfs://~/.knitDeps
UserGroupInformation.getLoginUser()
method)That's a puzzling set of behaviours. I'll get back to you next week.
I confused myself again because my HADOOP_CONF_DIR
changes the logging level, so it suppresses the output of knit.Client
. The same behavior happens if you have this environment variable set or unset. I have amended the previous comment to reflect this.
I printed lots of additional details from the python end. I did this to understand why HADOOP_CONF_DIR
is breaking the job, but since it doesn't break the job, this new detail isn't that informative. I still list the details in case we can learn anything from them.
The job successfully:
knit.Client
with the following output (note that I added a print statement for the gateway_port
from python that shows after scala sends the gateway port):
17/07/13 13:04:33 INFO knit.Client$: First line of main
17/07/13 13:04:33 INFO knit.Client$: Started PythonGatewayServer on port 35585
17/07/13 13:04:34 INFO knit.Client$: ClientConfig(127.0.0.1,40839)
17/07/13 13:04:34 INFO knit.Client$: Communicating GatewayServer port to Python driver at 127.0.0.1:40839
gateway_port 35585
17/07/13 13:04:34 INFO knit.Client$: Staring Application Master
17/07/13 13:04:40 INFO knit.Utils$: Setting Replication Factor to: 3
17/07/13 13:04:40 INFO knit.Utils$: Attemping upload of /home/jlord/.conda/envs/dask/lib/python3.5/site-packages/knit/java_libs/knit-1.0-SNAPSHOT.jar
17/07/13 13:04:42 INFO knit.Client$: hdfs://nameservice1/user/jlord/.knitDeps/knit-1.0-SNAPSHOT.jar
17/07/13 13:04:43 INFO knit.Utils$: Attemping upload of /home/jlord/.conda/envs/dask/lib/python3.5/site-packages/knit/tmp_conda/miniconda/envs/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58.zip
17/07/13 13:04:51 INFO knit.Client$: Submitting application application_1498411385005_54814
17/07/13 13:04:52 INFO impl.YarnClientImpl: Submitted application application_1498411385005_54814
JavaGateway
for the client.app_id
from the gateway entry point start
method.master_rpcport
and master_rpchost
.JavaGateway
for the master.I'm having the same issue (with an AD cluster). I agree that the logs imply that the ApplicationMaster is not providing delegation tokens to the container(s). I don't think I know enough about Java/Scala to be much help in solving the problem, but I'm more than willing to test or try to provide more context and logs.
I have finally merged https://github.com/dask/knit/pull/66 and will be attempting experiments in the coming days. Please join me over there, and I hope we can converge on a solution.
I can submit a job (with new knit and replication factor 3) but my job browser immediately shows the following error:
The command line output is the following: