CODAIT / spark-netezza

Netezza Connector for Apache Spark
Apache License 2.0
13 stars 7 forks source link

java exception when showing join #5

Open webe3 opened 8 years ago

webe3 commented 8 years ago

I am using pyspark with netezza. I am getting a java exception when trying to show the first row of a join. I can show the first row for of the two dataframes separately but not the result of a join. I get the same error for any action I take(first, collect, show). I can do joins fine on non-netezza dataframes. I get the same or similar error when using withColumn.

from pyspark.sql import SQLContext sqlContext = SQLContext(sc) disputedf = sqlContext.read.format('com.ibm.spark.netezza').options(url='jdbc:netezza://**:5480/db', user='_', password='_', dbtable='table1', driver='com.ibm.spark.netezza').load() dispute_df.printSchema() commentsdf = sqlContext.read.format('com.ibm.spark.netezza').options(url='jdbc:netezza://**:5480/db', user='_', password='_', dbtable='table2', driver='com.ibm.spark.netezza').load() comments_df.printSchema() dispute_df.join(comments_df, dispute_df.COMMENTID == comments_df.COMMENTID).first()

root |-- COMMENTID: string (nullable = true) |-- EXPORTDATETIME: timestamp (nullable = true) |-- ARTAGS: string (nullable = true) |-- POTAGS: string (nullable = true) |-- INVTAG: string (nullable = true) |-- ACTIONTAG: string (nullable = true) |-- DISPUTEFLAG: string (nullable = true) |-- ACTIONFLAG: string (nullable = true) |-- CUSTOMFLAG1: string (nullable = true) |-- CUSTOMFLAG2: string (nullable = true)

root |-- COUNTRY: string (nullable = true) |-- CUSTOMER: string (nullable = true) |-- INVNUMBER: string (nullable = true) |-- INVSEQNUMBER: string (nullable = true) |-- LEDGERCODE: string (nullable = true) |-- COMMENTTEXT: string (nullable = true) |-- COMMENTTIMESTAMP: timestamp (nullable = true) |-- COMMENTLENGTH: long (nullable = true) |-- FREEINDEX: long (nullable = true) |-- COMPLETEDFLAG: long (nullable = true) |-- ACTIONFLAG: long (nullable = true) |-- FREETEXT: string (nullable = true) |-- USERNAME: string (nullable = true) |-- ACTION: string (nullable = true) |-- COMMENTID: string (nullable = true)


Py4JJavaError Traceback (most recent call last)

in () 5 comments_df = sqlContext.read.format('com.ibm.spark.netezza').options(url='jdbc:netezza://**_:5480/db', user='**_', password='***', dbtable='table2', driver='com.ibm.spark.netezza').load() 6 comments_df.printSchema() ----> 7 dispute_df.join(comments_df, dispute_df.COMMENTID == comments_df.COMMENTID).first() /usr/local/src/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc in first(self) 802 Row(age=2, name=u'Alice') 803 """ --> 804 return self.head() 805 806 @ignore_unicode_prefix /usr/local/src/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc in head(self, n) 790 """ 791 if n is None: --> 792 rs = self.head(1) 793 return rs[0] if rs else None 794 return self.take(n) /usr/local/src/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc in head(self, n) 792 rs = self.head(1) 793 return rs[0] if rs else None --> 794 return self.take(n) 795 796 @ignore_unicode_prefix /usr/local/src/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc in take(self, num) 304 with SCCallSiteSync(self._sc) as css: 305 port = self._sc._jvm.org.apache.spark.sql.execution.EvaluatePython.takeAndServe( --> 306 self._jdf, num) 307 return list(_load_from_socket(port, BatchedSerializer(PickleSerializer()))) 308 /usr/local/src/spark/spark-1.6.1-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in **call**(self, *args) 811 answer = self.gateway_client.send_command(command) 812 return_value = get_return_value( --> 813 answer, self.gateway_client, self.target_id, self.name) 814 815 for temp_arg in temp_args: /usr/local/src/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/utils.pyc in deco(_a, *_kw) 43 def deco(_a, *_kw): 44 try: ---> 45 return f(_a, *_kw) 46 except py4j.protocol.Py4JJavaError as e: 47 s = e.java_exception.toString() /usr/local/src/spark/spark-1.6.1-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 306 raise Py4JJavaError( 307 "An error occurred while calling {0}{1}{2}.\n". --> 308 format(target_id, ".", name), value) 309 else: 310 raise Py4JError( Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 59.0 failed 1 times, most recent failure: Lost task 2.0 in stage 59.0 (TID 1406, localhost): java.io.IOException: EOF whilst processing escape sequence at org.apache.commons.csv.Lexer.readEscape(Lexer.java:346) at org.apache.commons.csv.Lexer.parseSimpleToken(Lexer.java:200) at org.apache.commons.csv.Lexer.nextToken(Lexer.java:161) at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:498) at org.apache.commons.csv.CSVParser.getRecords(CSVParser.java:365) at com.ibm.spark.netezza.NetezzaRecordParser.parse(NetezzaRecordParser.scala:43) at com.ibm.spark.netezza.NetezzaDataReader.next(NetezzaDataReader.scala:136) at com.ibm.spark.netezza.NetezzaRDD$$anon$1.getNext(NetezzaRDD.scala:77) at com.ibm.spark.netezza.NetezzaRDD$$anon$1.hasNext(NetezzaRDD.scala:106) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1143) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:618) at java.lang.Thread.run(Thread.java:785) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at java.lang.Thread.getStackTrace(Thread.java:1117) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:212) at org.apache.spark.sql.execution.EvaluatePython$$anonfun$takeAndServe$1.apply$mcI$sp(python.scala:126) at org.apache.spark.sql.execution.EvaluatePython$$anonfun$takeAndServe$1.apply(python.scala:124) at org.apache.spark.sql.execution.EvaluatePython$$anonfun$takeAndServe$1.apply(python.scala:124) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086) at org.apache.spark.sql.execution.EvaluatePython$.takeAndServe(python.scala:124) at org.apache.spark.sql.execution.EvaluatePython.takeAndServe(python.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:95) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55) at java.lang.reflect.Method.invoke(Method.java:507) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:209) at java.lang.Thread.run(Thread.java:785) Caused by: java.io.IOException: EOF whilst processing escape sequence at org.apache.commons.csv.Lexer.readEscape(Lexer.java:346) at org.apache.commons.csv.Lexer.parseSimpleToken(Lexer.java:200) at org.apache.commons.csv.Lexer.nextToken(Lexer.java:161) at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:498) at org.apache.commons.csv.CSVParser.getRecords(CSVParser.java:365) at com.ibm.spark.netezza.NetezzaRecordParser.parse(NetezzaRecordParser.scala:43) at com.ibm.spark.netezza.NetezzaDataReader.next(NetezzaDataReader.scala:136) at com.ibm.spark.netezza.NetezzaRDD$$anon$1.getNext(NetezzaRDD.scala:77) at com.ibm.spark.netezza.NetezzaRDD$$anon$1.hasNext(NetezzaRDD.scala:106) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1143) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:618) ... 1 more In [ ]:
sureshthalamati commented 8 years ago

Thanks for reporting this issue. By any chance are there any new lines in the data of these tables ? From the error , I am not able to find much. Do you mind trying join with columns that is not likely to have any special control chars.

webe3 commented 8 years ago

I have narrowed down which field causes the problem. If I do a select on fields other than the one that causes the problem and then do the join, the join works. The field causing the problem may have special characters in it. How can we figure out what control chars are causing the problem and make a work around? The stack trace looks like it is using csv code. Does spark-netezza create a csv file during the processing? If so, is there a way to capture the csv file and look at it?

webe3 commented 8 years ago

Almost all of my netezza tables have at least one field with new lines in the data. The fact that spark-netezza can't handle the fields makes spark-netezza almost useless for me. I am willing to work with someone to try and get a solution for this limitation.

sureshthalamati commented 8 years ago

Thanks for the input. Unfortunately newline limitation comes from netezza external table mechanism. There are some options for control chars (CRinString, CtrlChars option). I plan to expose these option in next version.

If you have any ideas to solve this problem, I will be happy too work with you.

Which version of Netezza you are using ?

sureshthalamati commented 8 years ago

Does spark-netezza create a csv file during the processing? If so, is there a way to capture the csv >>file and look at it? Format is CSV, data s transferred through named pipes. I can not think of easy way to capture the data that has special chars

Nomii5007 commented 8 years ago

@sureshthalamati Sir May be this is irrelevant to this post. but I am getting the same error. I am following this example tutorial

I am using spark1.5.1 and python 3.5 anaconda distribution.My code was running fine untill I reached at 7th cell.. this pd.DataFrame(CV_data.take(5), columns=CV_data.columns)

The Error is

Py4JJavaError Traceback (most recent call last)
<ipython-input-10-d3dfeab0b119> in <module>()
----> 1 pd.DataFrame(CV_data.take(5), columns=CV_data.columns)
C:\Users\InAm-Ur-Rehman\spark-1.5.1-bin-hadoop2.6\python\pyspark\sql\dataframe.py in take(self, num)
303 with SCCallSiteSync(self._sc) as css:
304 port = self._sc._jvm.org.apache.spark.sql.execution.EvaluatePython.takeAndServe(
--> 305 self._jdf, num)
306 return list(_load_from_socket(port, BatchedSerializer(PickleSerializer())))
307
C:\Users\InAm-Ur-Rehman\spark-1.5.1-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py in _call_(self, *args)
536 answer = self.gateway_client.send_command(command)
537 return_value = get_return_value(answer, self.gateway_client,
--> 538 self.target_id, self.name)
539 
540 for temp_arg in temp_args:
C:\Users\InAm-Ur-Rehman\spark-1.5.1-bin-hadoop2.6\python\pyspark\sql\utils.py in deco(*a, **kw)
34 def deco(*a, **kw):
35 try:
---> 36 return f(*a, **kw)
37 except py4j.protocol.Py4JJavaError as e:
38 s = e.java_exception.toString()
C:\Users\InAm-Ur-Rehman\spark-1.5.1-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
298 raise Py4JJavaError(
299 'An error occurred while calling
{0} {1} {2}
.\n'.
--> 300 format(target_id, '.', name), value)
301 else:
302 raise Py4JError(
Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 11.0 failed 1 times, most recent failure: Lost task 0.0 in stage 11.0 (TID 18, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "C:\Users\InAm-Ur-Rehman\spark-1.5.1-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\worker.py", line 111, in main
File "C:\Users\InAm-Ur-Rehman\spark-1.5.1-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\worker.py", line 106, in process
File "C:\Users\InAm-Ur-Rehman\spark-1.5.1-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\serializers.py", line 263, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "C:\Users\InAm-Ur-Rehman\spark-1.5.1-bin-hadoop2.6\python\pyspark\sql\functions.py", line 1417, in <lambda>
func = lambda _, it: map(lambda x: returnType.toInternal(f(*x)), it)
File "<ipython-input-7-6db2287430d4>", line 5, in <lambda>
KeyError: False
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1.apply(python.scala:397)
at org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1.apply(python.scala:362)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:706)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:706)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1822)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1835)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1848)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:215)
at org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:207)
at org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385)
at org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903)
at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384)
at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1314)
at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1377)
at org.apache.spark.sql.execution.EvaluatePython$.takeAndServe(python.scala:127)
at org.apache.spark.sql.execution.EvaluatePython.takeAndServe(python.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "C:\Users\InAm-Ur-Rehman\spark-1.5.1-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\worker.py", line 111, in main
File "C:\Users\InAm-Ur-Rehman\spark-1.5.1-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\worker.py", line 106, in process
File "C:\Users\InAm-Ur-Rehman\spark-1.5.1-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\serializers.py", line 263, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "C:\Users\InAm-Ur-Rehman\spark-1.5.1-bin-hadoop2.6\python\pyspark\sql\functions.py", line 1417, in <lambda>
func = lambda _, it: map(lambda x: returnType.toInternal(f(*x)), it)
File "<ipython-input-7-6db2287430d4>", line 5, in <lambda>
KeyError: False
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1.apply(python.scala:397)
at org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1.apply(python.scala:362)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:706)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:706)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
... 1 more
In [ ]:
sureshthalamati commented 8 years ago

@Nomii5007 From the web page , it look like you are not using this package , and also did not find anything in the stack. I am not very familiar with Pyhon. Can you please post the problem to spark user group.

Nomii5007 commented 8 years ago

@sureshthalamati sir which package are you talking about? and where can i find spark user group? I am new to this technology so don't know much about it yet.