Spark app can't connect to HDFS: RPC response exceeds maximum data length

MattCoachCarter commented 4 years ago

I'm trying to run a spark app that connects to HDFS using the docker-compose in this repo (which I have modified). The Spark container I am using is, I believe, able to connect to the HDFS container, but it receives a RPC error soon after:

I've tried a handful of things with no success, was wondering if anyone had an idea of how I can troubleshoot this:

java.io.IOException: Failed on local exception: org.apache.hadoop.ipc.RpcException: RPC response exceeds maximum data length; Host Details : local host is: "sparkmaster/172.18.13.9"; destination host is: "datanode":50075;
    at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:808)
    at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1495)
    at org.apache.hadoop.ipc.Client.call(Client.java:1437)
    at org.apache.hadoop.ipc.Client.call(Client.java:1347)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
    at com.sun.proxy.$Proxy18.getFileInfo(Unknown Source)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:874)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
    at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
    at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
    at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
    at com.sun.proxy.$Proxy19.getFileInfo(Unknown Source)
    at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1697)
    at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1491)
    at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1488)
    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
    at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1503)
    at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1668)
    at com.mastercard.ess.schema.SchemaRegistry$.addSchema(SchemaRegistry.scala:45)
    at com.mastercard.ess.inputs.builder.KafkaInputBuilder$.build(KafkaInputBuilder.scala:20)
    at com.mastercard.ess.inputs.InputRegistrar$.registerInput(InputRegistrar.scala:22)
    at com.mastercard.ess.jobs.JobRegistrationManager$$anonfun$registerFromJson$1.apply(JobRegistrationManager.scala:197)
    at com.mastercard.ess.jobs.JobRegistrationManager$$anonfun$registerFromJson$1.apply(JobRegistrationManager.scala:179)
    at scala.collection.immutable.List.foreach(List.scala:392)
    at com.mastercard.ess.jobs.JobRegistrationManager$.registerFromJson(JobRegistrationManager.scala:179)
    at com.mastercard.ess.jobs.JobRegistrationManager$.registerFromHDFSJson(JobRegistrationManager.scala:153)
    at com.mastercard.ess.jobs.JobRegistrationManager$.registerJobsAndMonitorChanges(JobRegistrationManager.scala:61)
    at com.mastercard.ess.StreamKafka$.main(StreamKafka.scala:45)
    at com.mastercard.ess.StreamKafka.main(StreamKafka.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.hadoop.ipc.RpcException: RPC response exceeds maximum data length
    at org.apache.hadoop.ipc.Client$IpcStreams.readResponse(Client.java:1810)
    at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1165)
    at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1061)

MattCoachCarter commented 4 years ago

I was using the wrong port, sorry!

staftermath commented 4 years ago

Hi, can you help with the set up of spark config to connect to hive using these containers? I added the following to spark config:

spark_config = {
"spark.hive.metastore.uris": "thrift://localhost:9083",
"spark.hadoop.dfs.namenode.http-address": "webhdfs://localhost:50070"
}
spark_conf = SparkConf()
for attribute, value in spark_config.items():
    spark_conf.set(attribute, value)
spark = SparkSession.builder.config(conf=spark_conf).enableHiveSupport().getOrCreate()

I am able to connect to metastore properly as I am able to do

spark.sql("describe test_db.test_table").printSchema()

where test_db.test_table is a table I created directly through hive. However, when I attempt to select the content of the table (spark.sql("select * from test_db.test_table").show()), it gives me the following error:

pyspark.sql.utils.IllegalArgumentException: java.net.UnknownHostException: namenode

Not sure what I am missing here. Thanks for the help in advance.

big-data-europe / docker-hive

Spark app can't connect to HDFS: RPC response exceeds maximum data length #32