dask / old-dask-yarn

Deprecated, please use https://github.com/jcrist/skein or https://github.com/dask/dask-yarn instead
7 stars 2 forks source link

hadoop yarn webapp NotFoundExecption #17

Open quartox opened 7 years ago

quartox commented 7 years ago

I can submit a job (with new knit and replication factor 3) but my job browser immediately shows the following error:

Could not find job application_1495984298947_16153.

{"RemoteException":{"exception":"NotFoundException","message":"java.lang.Exception: job, job_1495984298947_16153, is not found","javaClassName":"org.apache.hadoop.yarn.webapp.NotFoundException"}} (error 404)

The command line output is the following:

ClientConfig(127.0.0.1,53775)
17/06/02 17:39:32 INFO knit.Client$: Staring Application Master
Setting Replication Factor to: 3
Attemping upload of /home/jlord/.conda/envs/dask/lib/python3.5/site-packages/knit/java_libs/knit-1.0-SNAPSHOT.jar
Uploading resource file:/home/jlord/.conda/envs/dask/lib/python3.5/site-packages/knit/java_libs/knit-1.0-SNAPSHOT.jar -> hdfs://nameservice1/user/jlord/.knitDeps/knit-1.0-SNAPSHOT.jarhdfs://nameservice1/user/jlord/.knitDeps/knit-1.0-SNAPSHOT.jar
Attemping upload of /home/jlord/.conda/envs/dask/lib/python3.5/site-packages/knit/tmp_conda/miniconda/envs/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58.zip
Uploading resource file:/home/jlord/.conda/envs/dask/lib/python3.5/site-packages/knit/tmp_conda/miniconda/envs/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58.zip -> hdfs://nameservice1/user/jlord/.knitDeps/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58.zip17/06/02 17:39:42 INFO knit.Client$: Submitting application application_1495984298947_16153
17/06/02 17:39:42 INFO impl.YarnClientImpl: Submitted application application_1495984298947_16153
quasiben commented 7 years ago

@quartox apologies for the delayed response.

Do you mean that the YARN UI is showing that response ? Can you querying the status/log from the CLI ?

yarn application -status application_1495984298947_16153 yarn logs -applicationId application_1495984298947_16153

Can you also post the entire output from dask-yarn ?

quartox commented 7 years ago

That was the full command line output (it never died on the command line). I just re-ran the job and got the following output:

command line

ClientConfig(127.0.0.1,52457)
17/06/06 08:40:58 INFO knit.Client$: Staring Application Master
Setting Replication Factor to: 3
Attemping upload of /home/jlord/.conda/envs/dask/lib/python3.5/site-packages/knit/java_libs/knit-1.0-SNAPSHOT.jar
Uploading resource file:/home/jlord/.conda/envs/dask/lib/python3.5/site-packages/knit/java_libs/knit-1.0-SNAPSHOT.jar -> hdfs://nameservice1/user/jlord/.knitDeps/knit-1.0-SNAPSHOT.jarhdfs://nameservice1/user/jlord/.knitDeps/knit-1.0-SNAPSHOT.jar
Attemping upload of /home/jlord/.conda/envs/dask/lib/python3.5/site-packages/knit/tmp_conda/miniconda/envs/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58.zip
Uploading resource file:/home/jlord/.conda/envs/dask/lib/python3.5/site-packages/knit/tmp_conda/miniconda/envs/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58.zip -> hdfs://nameservice1/user/jlord/.knitDeps/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58.zip17/06/06 08:41:07 INFO knit.Client$: Submitting application application_1496613553830_6169
17/06/06 08:41:07 INFO impl.YarnClientImpl: Submitted application application_1496613553830_6169

yarn application -status

Application Report : 
    Application-Id : application_1496613553830_6169
    Application-Name : knit
    Application-Type : YARN
    User : jlord
    Queue : root.default
    Start-Time : 1496756467259
    Finish-Time : 1496756467998
    Progress : 0%
    State : FAILED
    Final-State : FAILED
    Tracking-URL : http://hostname.allstate.com:8088/cluster/app/application_1496613553830_6169
    RPC Port : -1
    AM Host : N/A
    Aggregate Resource Allocation : 954 MB-seconds, 0 vcore-seconds
    Log Aggregation Status : NOT_START
    Diagnostics : Application application_1496613553830_6169 failed 3 times due to AM Container for appattempt_1496613553830_6169_000003 exited with  exitCode: -1000
For more detailed output, check application tracking page:http://hostname.allstate.com:8088/proxy/application_1496613553830_6169/Then, click on links to logs of each attempt.
Diagnostics: org.apache.hadoop.security.AccessControlException: Permission denied: user=yarn, access=EXECUTE, inode="/user/jlord":jlord:hdpqra:drwxr-x---
    at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkFsPermission(DefaultAuthorizationProvider.java:279)
    at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.check(DefaultAuthorizationProvider.java:260)
    at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkTraverse(DefaultAuthorizationProvider.java:201)
    at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkPermission(DefaultAuthorizationProvider.java:154)
    at org.apache.sentry.hdfs.SentryAuthorizationProvider.checkPermission(SentryAuthorizationProvider.java:178)
    at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:152)
    at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:3529)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6634)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:4277)
    at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:901)
    at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getFileInfo(AuthorizationProviderProxyClientProtocol.java:528)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:839)
    at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2216)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2212)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2210)

Failing this attempt. Failing the application.

yarn logs -applicationId

/tmp/logs/jlord/logs/application_1496613553830_6169does not exist.
Log aggregation has not completed or is not enabled
quasiben commented 7 years ago

Thanks for providing the logs!

It's a little hard to see but the job failed because of a permission error:

Diagnostics: org.apache.hadoop.security.AccessControlException: Permission denied: user=yarn, access=EXECUTE, inode="/user/jlord":jlord:hdpqra:drwxr-x---

I'm not sure why the user yarn is trying to write to (what I assume is your user's) jlord directory. What should happen is that jlord should be writing to jlord's directory. Specifically, to:

/user/jlord/.knitDeps/

quartox commented 7 years ago

We use kerbores and sentry for security. Any hive job looks like it is being written by the hive user. I will try creating that directory and opening up write permission.

quartox commented 7 years ago

It successfully wrote /user/jlord/.knitDeps/ but it failed to execute /user/jlord/ so I opened up permission to my home directory to be executed by all and it worked. I am guessing this is a sentry issue, but I barely understand how sentry operates.

quartox commented 7 years ago

The job still does not finish but the new output is:

yarn application -status

Application Report : 
    Application-Id : application_1496613553830_6188
    Application-Name : knit
    Application-Type : YARN
    User : jlord
    Queue : root.default
    Start-Time : 1496757595631
    Finish-Time : 0
    Progress : 10%
    State : RUNNING
    Final-State : UNDEFINED
    Tracking-URL : N/A
    RPC Port : 45533
    AM Host : 10.195.102.85
    Aggregate Resource Allocation : 742292 MB-seconds, 359 vcore-seconds
    Log Aggregation Status : NOT_START
    Diagnostics : 

But looking into the container logs:

17/06/06 09:00:05 ERROR knit.ApplicationMaster$: Received error
java.io.IOException: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "hostname1.allstate.com/10.195.102.85"; destination host is: "hostname2.allstate.com":8020; 
    at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
    at org.apache.hadoop.ipc.Client.call(Client.java:1476)
    at org.apache.hadoop.ipc.Client.call(Client.java:1409)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
    at com.sun.proxy.$Proxy16.getFileInfo(Unknown Source)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
    at com.sun.proxy.$Proxy17.getFileInfo(Unknown Source)
    at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2123)
    at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1253)
    at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1249)
    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
    at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1249)
    at io.continuum.knit.Utils$.setUpLocalResource(Utils.scala:23)
    at io.continuum.knit.ApplicationMaster$$anonfun$onContainersAllocated$1.apply(ApplicationMaster.scala:176)
    at io.continuum.knit.ApplicationMaster$$anonfun$onContainersAllocated$1.apply(ApplicationMaster.scala:140)
    at scala.collection.Iterator$class.foreach(Iterator.scala:743)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1174)
    at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
    at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
    at io.continuum.knit.ApplicationMaster$.onContainersAllocated(ApplicationMaster.scala:140)
    at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:292)
Caused by: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
    at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:688)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
    at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:651)
    at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:739)
    at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:376)
    at org.apache.hadoop.ipc.Client.getConnection(Client.java:1525)
    at org.apache.hadoop.ipc.Client.call(Client.java:1448)
    ... 25 more
Caused by: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
    at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212)
    at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:413)
    at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:561)
    at org.apache.hadoop.ipc.Client$Connection.access$1900(Client.java:376)
    at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:731)
    at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:727)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
    at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:726)
    ... 28 more
Caused by: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)
    at sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:147)
    at sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:121)
    at sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:187)
    at sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:223)
    at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:212)
    at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179)
    at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:193)
    ... 37 more
Exception in thread "AMRM Callback Handler Thread" org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "hostname1.allstate.com/10.195.102.85"; destination host is: "hostname2.allstate.com":8020; 
    at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:299)
Caused by: java.io.IOException: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "hostname1.allstate.com/10.195.102.85"; destination host is: "hostname2.allstate.com":8020; 
    at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
    at org.apache.hadoop.ipc.Client.call(Client.java:1476)
    at org.apache.hadoop.ipc.Client.call(Client.java:1409)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
    at com.sun.proxy.$Proxy16.getFileInfo(Unknown Source)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
    at com.sun.proxy.$Proxy17.getFileInfo(Unknown Source)
    at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2123)
    at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1253)
    at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1249)
    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
    at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1249)
    at io.continuum.knit.Utils$.setUpLocalResource(Utils.scala:23)
    at io.continuum.knit.ApplicationMaster$$anonfun$onContainersAllocated$1.apply(ApplicationMaster.scala:176)
    at io.continuum.knit.ApplicationMaster$$anonfun$onContainersAllocated$1.apply(ApplicationMaster.scala:140)
    at scala.collection.Iterator$class.foreach(Iterator.scala:743)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1174)
    at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
    at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
    at io.continuum.knit.ApplicationMaster$.onContainersAllocated(ApplicationMaster.scala:140)
    at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:292)
Caused by: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
    at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:688)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
    at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:651)
    at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:739)
    at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:376)
    at org.apache.hadoop.ipc.Client.getConnection(Client.java:1525)
    at org.apache.hadoop.ipc.Client.call(Client.java:1448)
    ... 25 more
Caused by: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
    at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212)
    at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:413)
    at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:561)
    at org.apache.hadoop.ipc.Client$Connection.access$1900(Client.java:376)
    at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:731)
    at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:727)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
    at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:726)
    ... 28 more
Caused by: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)
    at sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:147)
    at sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:121)
    at sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:187)
    at sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:223)
    at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:212)
    at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179)
    at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:193)
    ... 37 more
quasiben commented 7 years ago

I need to do a review of how were handling kerberos. For now, can you confirm that you have a kerberos ticket on the machine you are submitting from ?

klist

quasiben commented 7 years ago

Thanks again for helping us work through these issues.

quartox commented 7 years ago

Yes, I reproduced it with a kerberos ticket. I automatically create one in my .bashrc, so I am pretty confident that it was there before as well.

martindurant commented 7 years ago

Reading around, the most obvious thing to look at first is the versions of java and kerb in your running environment versus the yarn machine's. Unfortunately, the trace above does not say what was happening to cause the failure. If this was during reading the knit files, can you check their permissions? (if user 'yarn' didn't initially have exec access on the directory, high chance it doesn't have read access to the files) (there seem many ways that you can get that same error message)

quartox commented 7 years ago

I will add the top of the stderr log while I check on the other details you asked about.

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.10.1-1.cdh5.10.1.p0.10/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/hadoop10/yarn/nm/filecache/17/knit-1.0-SNAPSHOT.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
17/06/28 14:28:36 INFO knit.ApplicationMaster$: Construct application master
17/06/28 14:28:37 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/06/28 14:28:38 INFO impl.NMClientAsyncImpl: Upper bound of the thread pool size is 500
17/06/28 14:28:38 INFO knit.ApplicationMaster$: Init application master
17/06/28 14:28:38 INFO knit.ApplicationMaster$: 0 containers still running, requesting 2
17/06/28 14:28:38 INFO knit.ApplicationMaster$: Add 2 containers
17/06/28 14:28:38 INFO knit.ApplicationMaster$: Request: 500 MB of RAM and 1 cores
17/06/28 14:28:38 INFO knit.ApplicationMaster$: Requested container resources: <memory:500, vCores:1>
17/06/28 14:28:38 INFO knit.ApplicationMaster$: Requested container ask: Capability[<memory:500, vCores:1>]Priority[0]
17/06/28 14:28:38 INFO knit.ApplicationMaster$: Requested container resources: <memory:500, vCores:1>
17/06/28 14:28:38 INFO knit.ApplicationMaster$: Requested container ask: Capability[<memory:500, vCores:1>]Priority[0]
17/06/28 14:28:40 INFO knit.ApplicationMaster$: Running command in container: $PYTHON_BIN $CONDA_PREFIX/bin/dask-worker --nprocs=1 --nthreads=1 --memory-limit=500000000 tcp://10.195.208.189:46305 > /tmp/worker-log.out 2> /tmp/worker-log.err
17/06/28 14:28:40 WARN security.UserGroupInformation: PriviledgedActionException as:jlord (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
17/06/28 14:28:40 WARN ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
17/06/28 14:28:40 WARN security.UserGroupInformation: PriviledgedActionException as:jlord (auth:KERBEROS) cause:java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
17/06/28 14:28:40 WARN security.UserGroupInformation: PriviledgedActionException as:jlord (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
17/06/28 14:28:40 WARN ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
17/06/28 14:28:40 WARN security.UserGroupInformation: PriviledgedActionException as:jlord (auth:KERBEROS) cause:java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
17/06/28 14:28:40 INFO retry.RetryInvocationHandler: Exception while invoking getFileInfo of class ClientNamenodeProtocolTranslatorPB over hostname1.allstate.com/10.195.208.187:8020 after 1 fail over attempts. Trying to fail over immediately.
java.io.IOException: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "hostname2.allstate.com/10.195.103.37"; destination host is: "hostname1.allstate.com":8020; 
quartox commented 7 years ago

I had previously made the .knitDeps directory permissions drwxrwxrwx. The jar file and dask zip have -rw-r--r--.

Also it looks like those uploaded successfully since it makes it to running the dask-worker code.

quartox commented 7 years ago

In terms of the java version, I built knit with the conda version of java-jdk, so the jar may have a different version from the cluster. I can try rebuilding knit with the default java.

I would be surprised if the kerberos versions differed between edge nodes and worker nodes, but I should check it out because I can't be certain.

quartox commented 7 years ago

I tried re-building knit with a local version of java and got the exact same error.

I found the specifics from the process that is running the jar:

/nas/isg_prodops_work/Analytics_COE/SW/Installs/jdk1.8.0_101/bin/java -Xmx1000m -Dhadoop.log.dir=/opt/cloudera/parcels/CDH-5.10.1-1.cdh5.10.1.p0.10/lib/hadoop/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/opt/cloudera/parcels/CDH-5.10.1-1.cdh5.10.1.p0.10/lib/hadoop -Dhadoop.id.str= -Dhadoop.root.logger=INFO,console -Djava.library.path=/opt/cloudera/parcels/CDH-5.10.1-1.cdh5.10.1.p0.10/lib/hadoop/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.RunJar /home/jlord/.conda/envs/dask/lib/python3.5/site-packages/knit/java_libs/knit-1.0-SNAPSHOT.jar io.continuum.knit.Client --callbackHost 127.0.0.1 --callbackPort 51423

I am assuming most of this was generated automatically based on our cluster settings, but the hadoop-policy.xml kinds of stands out as having no path. Not sure if you can find anything helpful in there.

martindurant commented 7 years ago

@danielfrg , do you see anything obvious in the symptoms and error logs above, by any chance?

danielfrg commented 7 years ago

No more than noticing that is a kerberos (related) issue :) I don't have a lot experience with a kerberized cluster so I have not seen this issues much.

quartox commented 7 years ago

I have tried changing the HADOOP_CONF_DIR env var under the assumption that the configuration does not know where to find my kerberos ticket cache and I am getting a parsing error:

Traceback (most recent call last):
  File "test.py", line 2, in <module>
    cluster = YARNCluster()
  File "/nas/isg_prodops_work/jlord/learning/dask/dask-yarn/dask_yarn/core.py", line 49, in __init__
    self.knit = Knit(autodetect=True, validate=validate)
  File "/home/jlord/.conda/envs/dask/lib/python3.5/site-packages/knit/core.py", line 79, in __init__
    self.nn,  self.nn_port = self._hdfs_conf(autodetect)
  File "/home/jlord/.conda/envs/dask/lib/python3.5/site-packages/knit/core.py", line 169, in _hdfs_conf
    conf = parse_xml(core_site, 'fs.defaultFS')
  File "/home/jlord/.conda/envs/dask/lib/python3.5/site-packages/knit/utils.py", line 35, in parse_xml
    conf['port'] = int(conf['port'])
ValueError: invalid literal for int() with base 10: '//nameservice1'

I have tried a few different directories and each one gives the same results. Any ideas why it isn't parsing correctly and/or how to try to fix it?

martindurant commented 7 years ago

Is this in the new branch of https://github.com/dask/knit/pull/66 (where dask-yarn is included as part of knit)? I had put some work there into various ways to percolate parameters through (and you should be able to always override basic settings in a conf file with keyword parameters).

quartox commented 7 years ago

No, this is old. I will start working with that version.

quartox commented 7 years ago

Using the merge_dask_yarn branch I get the same parsing error. Code is

from knit.dask_yarn import DaskYARNCluster
cluster = DaskYARNCluster()

Error:

Traceback (most recent call last):
  File "test.py", line 2, in <module>
    cluster = DaskYARNCluster()
  File "/home/jlord/.conda/envs/dask/lib/python3.5/site-packages/knit/dask_yarn.py", line 62, in __init__
    user=user, validate=validate, autodetect=autodetect)
  File "/home/jlord/.conda/envs/dask/lib/python3.5/site-packages/knit/core.py", line 97, in __init__
    self._hdfs_conf(autodetect, validate)
  File "/home/jlord/.conda/envs/dask/lib/python3.5/site-packages/knit/core.py", line 175, in _hdfs_conf
    conf = parse_xml(f, 'fs.defaultFS')
  File "/home/jlord/.conda/envs/dask/lib/python3.5/site-packages/knit/utils.py", line 35, in parse_xml
    conf['port'] = int(conf['port'])
ValueError: invalid literal for int() with base 10: '//nameservice1'
Exception ignored in: <bound method Knit.__del__ of Knit<NN=None:None;RM=None:None>>
Traceback (most recent call last):
  File "/home/jlord/.conda/envs/dask/lib/python3.5/site-packages/knit/core.py", line 465, in __del__
    if self.app_id is not None:
AttributeError: 'Knit' object has no attribute 'app_id'
martindurant commented 7 years ago

OK, this I can fix - I'll get back to you on Tuesday. Does your config file specify a port at all?

quartox commented 7 years ago

The fs.defaultFS is hdfs://nameservice1

martindurant commented 7 years ago

Does that then mean that the default HDFS namenode port (8020) is being used?

quartox commented 7 years ago

I can't find any port in core-site.xml. In hdfs-site.xml there are a few options that mention the nameservice port. I changed the hostnames, but this is the relevant section:

  <property>
    <name>dfs.nameservices</name>
    <value>nameservice1</value>
  </property>
  <property>
    <name>dfs.ha.namenodes.nameservice1</name>
    <value>namenode53,namenode941</value>
  </property>
  <property>
    <name>dfs.namenode.rpc-address.nameservice1.namenode53</name>
    <value>hostname.allstate.com:8020</value>
  </property>
  <property>
    <name>dfs.namenode.servicerpc-address.nameservice1.namenode53</name>
    <value>hostname.allstate.com:8022</value>
  </property>
  <property>
    <name>dfs.namenode.http-address.nameservice1.namenode53</name>
    <value>hostname.allstate.com:50070</value>
  </property>
  <property>
    <name>dfs.namenode.https-address.nameservice1.namenode53</name>
    <value>hostname.allstate.com:50470</value>
  </property>
  <property>
    <name>dfs.namenode.rpc-address.nameservice1.namenode941</name>
    <value>hostname2.allstate.com:8020</value>
  </property>
  <property>
    <name>dfs.namenode.servicerpc-address.nameservice1.namenode941</name>
    <value>hostname2.allstate.com:8022</value>
  </property>
  <property>
    <name>dfs.namenode.http-address.nameservice1.namenode941</name>
    <value>hostname2.allstate.com:50070</value>
  </property>
  <property>
    <name>dfs.namenode.https-address.nameservice1.namenode941</name>
    <value>hostname2.allstate.com:50470</value>
  </property>
martindurant commented 7 years ago

Interesting, this must a multiple-server configuration. Your default FS is "nameservice1", which is referenced through to namenode53 and namenode941, so the actual locations are hostname.allstate.com:8020 and hostname2.allstate.com:8020 .

This is similar to the High-Availability discussion in http://hdfs3.readthedocs.io/en/latest/hdfs.html#high-availability-mode .

OK, so that's all very interesting, but now I'm looking through the code, I can't actually see where the value of the port is being used (aside from my own hdfs3 additions which are optional). Maybe the yarn and hdfs clients can be assumed to read the parameters correctly on the scala side and none of this matters? What happens if you suppress all the file-finding and conversion exceptions, such that the port is still None when calling into the java gateway? What happens if you explicitly pass nn_port=8020?

quartox commented 7 years ago

Passing nn_port=8020 clears the config reading error but gets back to the kerberos error. I was hoping the config would fix the kerberos error but obviously not.

I still can't figure out how it is able to upload files but then fails to run python on the worker. The upload must be using the kerberos ticket while the application is not.

martindurant commented 7 years ago

I note that it is the dask worker that is giving the error: the yarn application master must have launched successfully to have created a container, but it doesn't have credentials. Code snippets like this suggest that that the ApplicationMaster must pass in the HDFS (delegation) token, but we are not (cf. https://github.com/dask/knit/blob/master/knit_jvm/src/main/scala/io/continuum/knit/ApplicationMaster.scala#L185 ). Since I don't have a kerberized setup to play with at the moment, I am shooting in the dark, and my colleague assures me that the kerberos auth was working previously (or still is for other setups).

quasiben commented 7 years ago

@martindurant, when I lasted tested with kerberos things they generally worked, however, this was naively done. I would suggest looking at the spark codebase -- specifically, around UGI and keytabs.

Steve Loughran's gitbook on kerberos is also really good resource

Probably the best thing to do is to setup a kerberized cluster again

quasiben commented 7 years ago

@quartox, we may be able to glean a bit more information if you run with a few more debug options:

export HADOOP_OPTS="-Djava.net.preferIPv4Stack=true -Dsun.security.krb5.debug=true ${HADOOP_OPTS}"

martindurant commented 7 years ago

@quasiben , thanks for chiming in, I was going to ask you separately. Let's table setting up a cluster until after the busy two weeks ahead, and perhaps @quartox hits on a solution in the meantime.

quartox commented 7 years ago

@quasiben @martindurant would you mind sending me contact info so I can share the debug output offline? My email is on my github home page.

quartox commented 7 years ago

One detail from Steve Loughran's gitbook: our kerberos cluster uses Microsoft Active Directory, and as he puts it: "For Developers, it is worth knowing that AD is subtly different from the MIT/Unix Kerberos controller, enough so that you should really test with a cluster using AD as the Kerbers infrastructure, alongside the MIT KDC."

quartox commented 7 years ago

Recap of where we are:

martindurant commented 7 years ago

That's a puzzling set of behaviours. I'll get back to you next week.

quartox commented 7 years ago

I confused myself again because my HADOOP_CONF_DIR changes the logging level, so it suppresses the output of knit.Client. The same behavior happens if you have this environment variable set or unset. I have amended the previous comment to reflect this.

I printed lots of additional details from the python end. I did this to understand why HADOOP_CONF_DIR is breaking the job, but since it doesn't break the job, this new detail isn't that informative. I still list the details in case we can learn anything from them.

The job successfully:

  1. Establishes the callback socket
  2. Starts the knit.Client with the following output (note that I added a print statement for the gateway_port from python that shows after scala sends the gateway port):
    17/07/13 13:04:33 INFO knit.Client$: First line of main
    17/07/13 13:04:33 INFO knit.Client$: Started PythonGatewayServer on port 35585
    17/07/13 13:04:34 INFO knit.Client$: ClientConfig(127.0.0.1,40839)
    17/07/13 13:04:34 INFO knit.Client$: Communicating GatewayServer port to Python driver at 127.0.0.1:40839
    gateway_port 35585
    17/07/13 13:04:34 INFO knit.Client$: Staring Application Master
    17/07/13 13:04:40 INFO knit.Utils$: Setting Replication Factor to: 3
    17/07/13 13:04:40 INFO knit.Utils$: Attemping upload of /home/jlord/.conda/envs/dask/lib/python3.5/site-packages/knit/java_libs/knit-1.0-SNAPSHOT.jar
    17/07/13 13:04:42 INFO knit.Client$: hdfs://nameservice1/user/jlord/.knitDeps/knit-1.0-SNAPSHOT.jar
    17/07/13 13:04:43 INFO knit.Utils$: Attemping upload of /home/jlord/.conda/envs/dask/lib/python3.5/site-packages/knit/tmp_conda/miniconda/envs/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58.zip
    17/07/13 13:04:51 INFO knit.Client$: Submitting application application_1498411385005_54814
    17/07/13 13:04:52 INFO impl.YarnClientImpl: Submitted application application_1498411385005_54814
  3. Start the JavaGateway for the client.
  4. Get the app_id from the gateway entry point start method.
  5. Get the master_rpcport and master_rpchost.
  6. Start the JavaGateway for the master.
kpweiler commented 7 years ago

I'm having the same issue (with an AD cluster). I agree that the logs imply that the ApplicationMaster is not providing delegation tokens to the container(s). I don't think I know enough about Java/Scala to be much help in solving the problem, but I'm more than willing to test or try to provide more context and logs.

martindurant commented 7 years ago

I have finally merged https://github.com/dask/knit/pull/66 and will be attempting experiments in the coming days. Please join me over there, and I hope we can converge on a solution.