Closed asfimport closed 3 years ago
Arvind Ravish: I get the same thing when running the repro code in Databricks. Can we get some description about what the error means?
Plan looks like this
== Physical Plan == *(7) HashAggregate(keys=[], functions=[finalmerge_count(merge count#506L) AS count(1)#502L|#506L) AS count(1)#502L]) +- Exchange SinglePartition +- *(6) HashAggregate(keys=[], functions=[partial_count(1) AS count#506L|#506L]) +- *(6) Project +- FlatMapGroupsInPandas [df1_c1#72L|#72L], myudf(df1_c1#72L, df1_c2#73, df1_c3#74, df1_c4#75, df2_c1#108L, df2_c2#109, df2_c3#110, df2_c4#111, df2_c5#112, df2_c6#113), [df1_c1#264L, df1_c2#265, df1_c3#266, df1_c4#267, df2_c1#268L, df2_c2#269, df2_c3#270, df2_c4#271, df2_c5#272, df2_c6#273|#264L, df1_c2#265, df1_c3#266, df1_c4#267, df2_c1#268L, df2_c2#269, df2_c3#270, df2_c4#271, df2_c5#272, df2_c6#273] +- *(5) Project [df1_c1#72L, df1_c1#72L, df1_c2#73, df1_c3#74, df1_c4#75, df2_c1#108L, df2_c2#109, df2_c3#110, df2_c4#111, df2_c5#112, df2_c6#113|#72L, df1_c1#72L, df1_c2#73, df1_c3#74, df1_c4#75, df2_c1#108L, df2_c2#109, df2_c3#110, df2_c4#111, df2_c5#112, df2_c6#113] +- *(5) SortMergeJoin [df1_c1#72L|#72L], [df2_c1#108L|#108L], Inner :- *(2) Sort [df1_c1#72L ASC NULLS FIRST|#72L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(df1_c1#72L, 200) : +- *(1) Project [df1_c1#72L, df1_c2#73, df1_c3#74, df1_c4#75|#72L, df1_c2#73, df1_c3#74, df1_c4#75] : +- *(1) Filter isnotnull(df1_c1#72L) : +- *(1) Scan ExistingRDD[index#71L,df1_c1#72L,df1_c2#73,df1_c3#74,df1_c4#75|#71L,df1_c1#72L,df1_c2#73,df1_c3#74,df1_c4#75] +- *(4) Sort [df2_c1#108L ASC NULLS FIRST|#108L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(df2_c1#108L, 200) +- *(3) Project [df2_c1#108L, df2_c2#109, df2_c3#110, df2_c4#111, df2_c5#112, df2_c6#113|#108L, df2_c2#109, df2_c3#110, df2_c4#111, df2_c5#112, df2_c6#113] +- *(3) Filter isnotnull(df2_c1#108L) +- *(3) Scan ExistingRDD[index#107L,df2_c1#108L,df2_c2#109,df2_c3#110,df2_c4#111,df2_c5#112,df2_c6#113|#107L,df2_c1#108L,df2_c2#109,df2_c3#110,df2_c4#111,df2_c5#112,df2_c6#113]
An error occurred while calling o1471.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 93 in stage 39.0 failed 4 times, most recent failure: Lost task 93.3 in stage 39.0 (TID 1261, 10.139.64.6, executor 1): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/databricks/spark/python/pyspark/worker.py", line 403, in main process() File "/databricks/spark/python/pyspark/worker.py", line 398, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/databricks/spark/python/pyspark/serializers.py", line 296, in dump_stream for series in iterator: File "/databricks/spark/python/pyspark/serializers.py", line 319, in load_stream for batch in generator(): File "/databricks/spark/python/pyspark/serializers.py", line 314, in generator for batch in reader: File "pyarrow/ipc.pxi", line 268, in iter (/arrow/python/build/temp.linux-x86_64-3.5/lib.cxx:70278) File "pyarrow/ipc.pxi", line 284, in pyarrow.lib._RecordBatchReader.read_next_batch (/arrow/python/build/temp.linux-x86_64-3.5/lib.cxx:70534) File "pyarrow/error.pxi", line 79, in pyarrow.lib.check_status (/arrow/python/build/temp.linux-x86_64-3.5/lib.cxx:8345) pyarrow.lib.ArrowIOError: read length must be positive or -1 at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:490) at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:172) at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:444) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.agg_doAggregateWithoutKey_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:634) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) at org.apache.spark.scheduler.Task.doRunTask(Task.scala:139) at org.apache.spark.scheduler.Task.run(Task.scala:112) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1432) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
Micah Kornfield / @emkornfield: This seems to indicate that the stream of data being passed to Arrow isn't in the correct format (or Arrow is misinterpreting it).
SURESH CHAGANTI: [~emkornfield@gmail.com] is there any size limit as to how much we can send to pandas_udf ? I am also seeing the same error as above and my groups are pretty large around 200M records and size is around 2 to 4 GB
Micah Kornfield / @emkornfield: Yes. I believe it is 2GB per shard currently.
SURESH CHAGANTI: got it, thank you [~emkornfield@gmail.com], is there any way we can increase that size?
I am assuming the data that gets sent to pandas_udf method is in the uncompressed format
SURESH CHAGANTI:
[~AbdealiJK]
do you find any fix or workaround for this error?
Abdeali Kothari: No, I had to stop using pandas UDFs due to this and find another approach for the transformations I need to do.
SURESH CHAGANTI:
Thank you [~AbdealiJK]
. can I know another approach you are using?
Abdeali Kothari: We changed our pipeline to do it with joins and explodes and used spark functions.
Was a while back, dont remember the exact specifics of that pipeline.
Micah Kornfield / @emkornfield: "I am assuming the data that gets sent to pandas_udf method is in the uncompressed format", yes I believe this to be true but this isn't really the main limitation.
Currently, ArrowBufs (the components that hold memory) are limited to be less than 2GB each and only. I need to cleanup https://github.com/apache/arrow/pull/5020 to address this (and other limitations). I actually might have mis-stated the actual limitations per shard for toPandas functions.
[~cutlerb]
do you know what the actual limits are? I can't seem to find any documentation on it.
SURESH CHAGANTI: thank you [~emkornfield@gmail.com] appreciate your time, I have ran multiple tests with different sizes of the data and shard with more than 2GB size was failed. glad to see the fix is in progress and it will be great if this change rolls out sooner. I will be happy to contribute to this issue.
Bryan Cutler / @BryanCutler: Sorry, I'm not sure of any documentation with the limits. It would be great to get that down somewhere and there should be a better error message for this, but maybe it should be done on the Spark side.
Micah Kornfield / @emkornfield: I agree, it should probably be on the spark size (assuming the root cause is hitting caps in Arrow).
SURESH CHAGANTI: sure, I will create an issue against spark, thank you!. for now to get rid of the issue, I will build the code with https://github.com/apache/arrow/pull/5020 and see what happens. thank you [~emkornfield@gmail.com] & @BryanCutler
SURESH CHAGANTI: I guess this branch address the 2 GB issue https://github.com/emkornfield/arrow/tree/int64_address could you please confirm? thank you.
Micah Kornfield / @emkornfield: Yes that is the one. It hasn't been tested too much yet, but if you have time to check to see if it works for you that would be great.
SURESH CHAGANTI: thank you. I will keep you guys posted on my testing.
Micah Kornfield / @emkornfield: There have been a few PRs checked in, but the full end-to-end IPC path has not been tested yet. CC [~fan_li_ya]
Liya Fan / @liyafan82: Sure. [~emkornfield@gmail.com] is right. After [~emkornfield@gmail.com] has finished the implementation of the 64 bit buffer, we have a few follow-up work items to do, before we can claim that the 2GB restrict is removed:
Ruslan Dautkhanov: [~fan_li_ya] looks like all the follow up jiras you mentioned are resolved now in Arrow 1.0, does this automatically resolve this Jira or there are some follow ups that are left?
I think Netty still has 2Gb limit? Would we still be running into that limit too? Thx
java.lang.IndexOutOfBoundsException: index: 0, length: 1073741824 (expected: range(0, 0))
at io.netty.buffer.ArrowBuf.checkIndex(ArrowBuf.java:716)
at io.netty.buffer.ArrowBuf.setBytes(ArrowBuf.java:954)
at org.apache.arrow.vector.BaseVariableWidthVector.reallocDataBuffer(BaseVariableWidthVector.java:508)
at org.apache.arrow.vector.BaseVariableWidthVector.handleSafe(BaseVariableWidthVector.java:1239)
at org.apache.arrow.vector.BaseVariableWidthVector.setSafe(BaseVariableWidthVector.java:1028)
Micah Kornfield / @emkornfield:
[~Tagar]
this doesn't automatically resolve the JIRA. But I think most of the work in Arrow is taken care of as of 1.0. I would guess there is likely still work to do in Spark and there might be a few more bugs to work out.
We've introduced an alternative allocator that isn't based on Netty.
Ruslan Dautkhanov: Thanks for the details [~emkornfield@gmail.com]! Great to know the issue with netty allocator and some other related issues are now resolved. Understood that there might be other things to complete on Arrow side to lift the 2Gb limitation. Created https://issues.apache.org/jira/browse/SPARK-32294 for the Spark side.
Dmitry Kravchuk: I've tryed pyarrow 2.0.0 on spark 2.4.4 and still got familiar error - "OSError: Invalid IPC message: negative bodyLength".
Do we have any coming news for resolving this issue?
Liya Fan / @liyafan82: [~dishka_krauch] It is difficult to find the cause without providing more details about the problem. The possible reasons that comes to my mind include:
Dmitry Kravchuk: [~fan_li_ya] I've just used code in the description of this issue at my Hadoop cluster.
Which detailes do you need to look at the whole problem? I can give you everything, just let me know.
Liya Fan / @liyafan82: [~dishka_krauch] The problem occured during IPC from python to Java? I think the stack traces and some other logs would be helpful.
Dmitry Kravchuk: [~fan_li_ya] okay, here we go.
Spark version - 2.4.4
Python env:
wheel (0.36.2)
I've tested many pyarrow versions on 2 functions which return 172 mb dataset
import pyspark
from pyspark.sql import functions as F, types as T
import pandas as pd
def analyze(spark, job_args, configs):
pdf1 = pd.DataFrame(
[[1234567, 0.0, "abcdefghij", "2000-01-01T00:00:00.000Z"]],
columns=['df1_c1', 'df1_c2', 'df1_c3', 'df1_c4']
)
df1 = spark.createDataFrame(
pd.concat([pdf1 for i in range(429)]).reset_index()).drop('index')
pdf2 = pd.DataFrame(
[[1234567, 0.0, "abcdefghijklmno", "2000-01-01", "abcdefghijklmno", "abcdefghijklmno"]],
columns=['df2_c1', 'df2_c2', 'df2_c3', 'df2_c4', 'df2_c5', 'df2_c6']
)
df2 = spark.createDataFrame(
pd.concat([pdf2 for i in range(4899)]).reset_index()).drop('index')
df3 = df1.join(df2, df1['df1_c1'] == df2['df2_c1'], how='inner')
def myudf(df):
return df
df4 = df3
udf = F.pandas_udf(df4.schema, F.PandasUDFType.GROUPED_MAP)(myudf)
df5 = df4.groupBy('df1_c1').apply(udf)
print('df5.count()', df5.count())
and 1.72 gb dataset using pandas_udf
import pyspark
from pyspark.sql import functions as F, types as T
import pandas as pd
def analyze(spark, job_args, configs):
pdf1 = pd.DataFrame(
[[1234567, 0.0, "abcdefghij", "2000-01-01T00:00:00.000Z"]],
columns=['df1_c1', 'df1_c2', 'df1_c3', 'df1_c4']
)
df1 = spark.createDataFrame(
pd.concat([pdf1 for i in range(429)]).reset_index()).drop('index')
pdf2 = pd.DataFrame(
[[1234567, 0.0, "abcdefghijklmno", "2000-01-01", "abcdefghijklmno", "abcdefghijklmno"]],
columns=['df2_c1', 'df2_c2', 'df2_c3', 'df2_c4', 'df2_c5', 'df2_c6']
)
df2 = spark.createDataFrame(
pd.concat([pdf2 for i in range(48993)]).reset_index()).drop('index')
df3 = df1.join(df2, df1['df1_c1'] == df2['df2_c1'], how='inner')
def myudf(df):
return df
df4 = df3
udf = F.pandas_udf(df4.schema, F.PandasUDFType.GROUPED_MAP)(myudf)
df5 = df4.groupBy('df1_c1').apply(udf)
print('df5.count()', df5.count())
You can find detail log after this table
Dataset size | pyarrow version | result | stderr | detail log | |
---|---|---|---|---|---|
- | - | - | - | - | - |
172 mb | 0.11.1 | success | df5.count() 2101671 | 1 | |
172 mb | 0.12.0 | success | df5.count() 2101671 | 1 | |
172 mb | 0.12.1 | success | df5.count() 2101671 | 1 | |
172 mb | 0.13.0 | success | df5.count() 2101671 | 1 | |
172 mb | 0.14.0 | success | df5.count() 2101671 | 1 | |
172 mb | 0.14.1 | success | df5.count() 2101671 | 1 | |
172 mb | 0.15.0 | error | java.lang.IllegalArgumentException | 2 | |
172 mb | 0.15.1 | error | java.lang.IllegalArgumentException | 2 | |
172 mb | 0.16.0 | error | java.lang.IllegalArgumentException | 2 | |
172 mb | 0.17.0 | error | java.lang.IllegalArgumentException | 2 | |
172 mb | 0.17.1 | error | java.lang.IllegalArgumentException | 2 | |
172 mb | 0.17.1 | error | java.lang.IllegalArgumentException | 2 | |
172 mb | 1.0.0 | error | java.lang.IllegalArgumentException | 2 | |
172 mb | 1.0.1 | error | java.lang.IllegalArgumentException | 2 | |
172 mb | 2.0.0 | error | java.lang.IllegalArgumentException | 2 | |
1.72 gb | 0.11.1 | error | pyarrow.lib.ArrowIOError: read length must be positive or -1 | 3 | |
1.72 gb | 0.12.0 | error | pyarrow.lib.ArrowIOError: read length must be positive or -1 | 3 | |
1.72 gb | 0.12.1 | error | pyarrow.lib.ArrowIOError: read length must be positive or -1 | 3 | |
1.72 gb | 0.13.0 | error | pyarrow.lib.ArrowIOError: read length must be positive or -1 | 3 | |
1.72 gb | 0.14.0 | error | pyarrow.lib.ArrowIOError: read length must be positive or -1 | 3 | |
1.72 gb | 0.14.1 | error | pyarrow.lib.ArrowIOError: read length must be positive or -1 | 3 | |
1.72 gb | 0.15.0 | error | pyarrow.lib.ArrowIOError: read length must be positive or -1 | 3 | |
1.72 gb | 0.15.1 | error | pyarrow.lib.ArrowIOError: read length must be positive or -1 | 3 | |
1.72 gb | 0.16.0 | error | pyarrow.lib.ArrowIOError: read length must be positive or -1 | 3 | |
1.72 gb | 0.17.0 | error | OSError: Invalid IPC message: negative bodyLength | 4 | |
1.72 gb | 0.17.1 | error | OSError: Invalid IPC message: negative bodyLength | 4 | |
1.72 gb | 1.0.0 | error | OSError: Invalid IPC message: negative bodyLength | 4 | |
1.72 gb | 1.0.1 | error | OSError: Invalid IPC message: negative bodyLength | 4 | |
1.72 gb | 2.0.0 | error | OSError: Invalid IPC message: negative bodyLength | 4 |
Detail logs:
1:
java<br> <br>20/12/17 12:41:42 INFO SparkContext: Running Spark version 2.4.4 <br>20/12/17 12:41:42 INFO SparkContext: Submitted application: temp <br>20/12/17 12:41:42 INFO SecurityManager: Changing view acls to: zeppelin <br>20/12/17 12:41:42 INFO SecurityManager: Changing modify acls to: zeppelin <br>20/12/17 12:41:42 INFO SecurityManager: Changing view acls groups to: <br>20/12/17 12:41:42 INFO SecurityManager: Changing modify acls groups to: <br>20/12/17 12:41:42 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(zeppelin); groups with view permissions: Set(); users with modify permissions: Set(zeppelin); groups with modify permissions: Set() <br>20/12/17 12:41:43 INFO Utils: Successfully started service 'sparkDriver' on port 36190. <br>20/12/17 12:41:43 INFO SparkEnv: Registering MapOutputTracker <br>20/12/17 12:41:43 INFO SparkEnv: Registering BlockManagerMaster <br>20/12/17 12:41:43 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information <br>20/12/17 12:41:43 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up <br>20/12/17 12:41:43 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-b79cca55-c0c3-4afd-b0b7-3f88faab235c <br>20/12/17 12:41:43 INFO MemoryStore: MemoryStore started with capacity 8.4 GB <br>20/12/17 12:41:43 INFO SparkEnv: Registering OutputCommitCoordinator <br>20/12/17 12:41:43 INFO log: Logging initialized @2400ms <br>20/12/17 12:41:43 INFO Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown <br>20/12/17 12:41:43 INFO Server: Started @2475ms <br>20/12/17 12:41:43 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. <br>20/12/17 12:41:43 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042. <br>20/12/17 12:41:43 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043. <br>20/12/17 12:41:43 WARN Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044. <br>20/12/17 12:41:43 WARN Utils: Service 'SparkUI' could not bind on port 4044. Attempting port 4045. <br>20/12/17 12:41:43 WARN Utils: Service 'SparkUI' could not bind on port 4045. Attempting port 4046. <br>20/12/17 12:41:43 INFO AbstractConnector: Started ServerConnector@1bc70483{HTTP/1.1,[http/1.1]}{0.0.0.0:4046} <br>20/12/17 12:41:43 INFO Utils: Successfully started service 'SparkUI' on port 4046. <br>20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@758cb21d{/jobs,null,AVAILABLE,@Spark} <br>20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@101c8ddc{/jobs/json,null,AVAILABLE,@Spark} <br>20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@71843b82{/jobs/job,null,AVAILABLE,@Spark} <br>20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4f0c10c0{/jobs/job/json,null,AVAILABLE,@Spark} <br>20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3b8a1690{/stages,null,AVAILABLE,@Spark} <br>20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@6b826248{/stages/json,null,AVAILABLE,@Spark} <br>20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@1a2100a{/stages/stage,null,AVAILABLE,@Spark} <br>20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3d3a8de8{/stages/stage/json,null,AVAILABLE,@Spark} <br>20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@2c89e805{/stages/pool,null,AVAILABLE,@Spark} <br>20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3ecb4790{/stages/pool/json,null,AVAILABLE,@Spark} <br>20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@6a74cc86{/storage,null,AVAILABLE,@Spark} <br>20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3c859ed{/storage/json,null,AVAILABLE,@Spark} <br>20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@60e24051{/storage/rdd,null,AVAILABLE,@Spark} <br>20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4f17c1a{/storage/rdd/json,null,AVAILABLE,@Spark} <br>20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@482d80fe{/environment,null,AVAILABLE,@Spark} <br>20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@9ebbf27{/environment/json,null,AVAILABLE,@Spark} <br>20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@7b798683{/executors,null,AVAILABLE,@Spark} <br>20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@7467245a{/executors/json,null,AVAILABLE,@Spark} <br>20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@50696058{/executors/threadDump,null,AVAILABLE,@Spark} <br>20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4afeff5{/executors/threadDump/json,null,AVAILABLE,@Spark} <br>20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@33d53c7d{/static,null,AVAILABLE,@Spark} <br>20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@104969b0{/,null,AVAILABLE,@Spark} <br>20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@385ffa54{/api,null,AVAILABLE,@Spark} <br>20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@25e21a77{/jobs/job/kill,null,AVAILABLE,@Spark} <br>20/12/17 12:41:43 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@267d103d{/stages/stage/kill,null,AVAILABLE,@Spark} <br>20/12/17 12:41:43 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://master2.host:4046 <br>20/12/17 12:41:44 INFO RMProxy: Connecting to ResourceManager at master2.host/10.9.14.25:8050 <br>20/12/17 12:41:44 INFO Client: Requesting a new application from cluster with 7 NodeManagers <br>20/12/17 12:41:44 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (618496 MB per container) <br>20/12/17 12:41:44 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead <br>20/12/17 12:41:44 INFO Client: Setting up container launch context for our AM <br>20/12/17 12:41:44 INFO Client: Setting up the launch environment for our AM container <br>20/12/17 12:41:44 INFO Client: Preparing resources for our AM container <br>20/12/17 12:41:44 INFO Client: Source and destination file systems are the same. Not copying hdfs:/apps/spark2/jars/spark2-ADH-yarn-archive.tar.gz <br>20/12/17 12:41:44 INFO Client: Uploading resource file:/opt/deltalake/delta-core_2.11-0.5.0.jar -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1413/delta-core_2.11-0.5.0.jar <br>20/12/17 12:41:44 INFO Client: Uploading resource file:/home/zeppelin/env3.tar.gz#env3 -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1413/env3.tar.gz <br>20/12/17 12:41:44 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1413/pyspark.zip <br>20/12/17 12:41:44 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/py4j-0.10.7-src.zip -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1413/py4j-0.10.7-src.zip <br>20/12/17 12:41:44 INFO Client: Uploading resource file:/code/dist/jobs.zip -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1413/jobs.zip <br>20/12/17 12:41:44 WARN Client: Same path resource file:///opt/deltalake/delta-core_2.11-0.5.0.jar added multiple times to distributed cache. <br>20/12/17 12:41:44 INFO Client: Uploading resource file:/tmp/spark-69bef22d-3a04-4957-b3a1-e9c32d458350/__spark_conf__7457082691748150995.zip -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1413/__spark_conf__.zip <br>20/12/17 12:41:45 INFO SecurityManager: Changing view acls to: zeppelin <br>20/12/17 12:41:45 INFO SecurityManager: Changing modify acls to: zeppelin <br>20/12/17 12:41:45 INFO SecurityManager: Changing view acls groups to: <br>20/12/17 12:41:45 INFO SecurityManager: Changing modify acls groups to: <br>20/12/17 12:41:45 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(zeppelin); groups with view permissions: Set(); users with modify permissions: Set(zeppelin); groups with modify permissions: Set() <br>20/12/17 12:41:45 INFO Client: Submitting application application_1605081684999_1413 to ResourceManager <br>20/12/17 12:41:46 INFO YarnClientImpl: Submitted application application_1605081684999_1413 <br>20/12/17 12:41:46 INFO SchedulerExtensionServices: Starting Yarn extension services with app application_1605081684999_1413 and attemptId None <br>20/12/17 12:41:47 INFO Client: Application report for application_1605081684999_1413 (state: ACCEPTED) <br>20/12/17 12:41:47 INFO Client: <br> client token: N/A <br> diagnostics: AM container is launched, waiting for AM container to Register with RM <br> ApplicationMaster host: N/A <br> ApplicationMaster RPC port: -1 <br> queue: default <br> start time: 1608198105987 <br> final status: UNDEFINED <br> tracking URL: http://master2.host:8088/proxy/application_1605081684999_1413/ <br> user: zeppelin <br>20/12/17 12:41:48 INFO Client: Application report for application_1605081684999_1413 (state: ACCEPTED) <br>20/12/17 12:41:49 INFO Client: Application report for application_1605081684999_1413 (state: ACCEPTED) <br>20/12/17 12:41:50 INFO Client: Application report for application_1605081684999_1413 (state: ACCEPTED) <br>20/12/17 12:41:51 INFO Client: Application report for application_1605081684999_1413 (state: ACCEPTED) <br>20/12/17 12:41:52 INFO Client: Application report for application_1605081684999_1413 (state: ACCEPTED) <br>20/12/17 12:41:52 INFO YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> master2.host, PROXY_URI_BASES -> http://master2.host:8088/proxy/application_1605081684999_1413), /proxy/application_1605081684999_1413 <br>20/12/17 12:41:52 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /jobs, /jobs/json, /jobs/job, /jobs/job/json, /stages, /stages/json, /stages/stage, /stages/stage/json, /stages/pool, /stages/pool/json, /storage, /storage/json, /storage/rdd, /storage/rdd/json, /environment, /environment/json, /executors, /executors/json, /executors/threadDump, /executors/threadDump/json, /static, /, /api, /jobs/job/kill, /stages/stage/kill. <br>20/12/17 12:41:52 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(spark-client://YarnAM) <br>20/12/17 12:41:53 INFO Client: Application report for application_1605081684999_1413 (state: RUNNING) <br>20/12/17 12:41:53 INFO Client: <br> client token: N/A <br> diagnostics: N/A <br> ApplicationMaster host: 10.9.14.31 <br> ApplicationMaster RPC port: -1 <br> queue: default <br> start time: 1608198105987 <br> final status: UNDEFINED <br> tracking URL: http://master2.host:8088/proxy/application_1605081684999_1413/ <br> user: zeppelin <br>20/12/17 12:41:53 INFO YarnClientSchedulerBackend: Application application_1605081684999_1413 has started running. <br>20/12/17 12:41:53 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 32804. <br>20/12/17 12:41:53 INFO NettyBlockTransferService: Server created on master2.host:32804 <br>20/12/17 12:41:53 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy <br>20/12/17 12:41:53 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, master2.host, 32804, None) <br>20/12/17 12:41:53 INFO BlockManagerMasterEndpoint: Registering block manager master2.host:32804 with 8.4 GB RAM, BlockManagerId(driver, master2.host, 32804, None) <br>20/12/17 12:41:53 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, master2.host, 32804, None) <br>20/12/17 12:41:53 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, master2.host, 32804, None) <br>20/12/17 12:41:53 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /metrics/json. <br>20/12/17 12:41:53 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@c75e0b3{/metrics/json,null,AVAILABLE,@Spark} <br>20/12/17 12:41:53 INFO EventLoggingListener: Logging events to hdfs:/spark2-history/application_1605081684999_1413 <br>20/12/17 12:41:56 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.31:45116) with ID 5 <br>20/12/17 12:41:56 INFO BlockManagerMasterEndpoint: Registering block manager node6.host:36177 with 8.4 GB RAM, BlockManagerId(5, node6.host, 36177, None) <br>20/12/17 12:41:58 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.27:55334) with ID 1 <br>20/12/17 12:41:58 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.27:55332) with ID 8 <br>20/12/17 12:41:58 INFO BlockManagerMasterEndpoint: Registering block manager node2.host:44835 with 8.4 GB RAM, BlockManagerId(1, node2.host, 44835, None) <br>20/12/17 12:41:59 INFO BlockManagerMasterEndpoint: Registering block manager node2.host:44752 with 8.4 GB RAM, BlockManagerId(8, node2.host, 44752, None) <br>20/12/17 12:41:59 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.28:33458) with ID 9 <br>20/12/17 12:41:59 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.28:33460) with ID 2 <br>20/12/17 12:41:59 INFO BlockManagerMasterEndpoint: Registering block manager node3.host:37568 with 8.4 GB RAM, BlockManagerId(9, node3.host, 37568, None) <br>20/12/17 12:41:59 INFO BlockManagerMasterEndpoint: Registering block manager node3.host:39825 with 8.4 GB RAM, BlockManagerId(2, node3.host, 39825, None) <br>20/12/17 12:41:59 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.32:54662) with ID 10 <br>20/12/17 12:41:59 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.32:54660) with ID 3 <br>20/12/17 12:41:59 INFO BlockManagerMasterEndpoint: Registering block manager node7.host:33706 with 8.4 GB RAM, BlockManagerId(10, node7.host, 33706, None) <br>20/12/17 12:41:59 INFO BlockManagerMasterEndpoint: Registering block manager node7.host:44495 with 8.4 GB RAM, BlockManagerId(3, node7.host, 44495, None) <br>20/12/17 12:41:59 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.30:51678) with ID 7 <br>20/12/17 12:41:59 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8 <br>20/12/17 12:41:59 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.26:57828) with ID 6 <br>20/12/17 12:41:59 INFO BlockManagerMasterEndpoint: Registering block manager node5.host:41353 with 8.4 GB RAM, BlockManagerId(7, node5.host, 41353, None) <br>20/12/17 12:41:59 INFO SharedState: loading hive config file: file:/etc/spark/conf.dist/hive-site.xml <br>20/12/17 12:41:59 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/code/dist/spark-warehouse'). <br>20/12/17 12:41:59 INFO SharedState: Warehouse path is 'file:/code/dist/spark-warehouse'. <br>20/12/17 12:41:59 INFO BlockManagerMasterEndpoint: Registering block manager node1.host:40037 with 8.4 GB RAM, BlockManagerId(6, node1.host, 40037, None) <br>20/12/17 12:41:59 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL. <br>20/12/17 12:41:59 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@13a6a145{/SQL,null,AVAILABLE,@Spark} <br>20/12/17 12:41:59 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL/json. <br>20/12/17 12:41:59 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@38b6cedb{/SQL/json,null,AVAILABLE,@Spark} <br>20/12/17 12:41:59 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL/execution. <br>20/12/17 12:41:59 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@124c532b{/SQL/execution,null,AVAILABLE,@Spark} <br>20/12/17 12:41:59 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL/execution/json. <br>20/12/17 12:41:59 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@6b73efc3{/SQL/execution/json,null,AVAILABLE,@Spark} <br>20/12/17 12:41:59 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /static/sql. <br>20/12/17 12:41:59 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@388c6f85{/static/sql,null,AVAILABLE,@Spark} <br>20/12/17 12:42:00 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.29:58708) with ID 4 <br>20/12/17 12:42:00 INFO BlockManagerMasterEndpoint: Registering block manager node4.host:44212 with 8.4 GB RAM, BlockManagerId(4, node4.host, 44212, None) <br>20/12/17 12:42:00 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint <br>df5.count() 2101671 <br>
2:
java<br> <br>20/12/17 12:59:06 INFO SparkContext: Running Spark version 2.4.4 <br>20/12/17 12:59:06 INFO SparkContext: Submitted application: temp <br>20/12/17 12:59:06 INFO SecurityManager: Changing view acls to: zeppelin <br>20/12/17 12:59:06 INFO SecurityManager: Changing modify acls to: zeppelin <br>20/12/17 12:59:06 INFO SecurityManager: Changing view acls groups to: <br>20/12/17 12:59:06 INFO SecurityManager: Changing modify acls groups to: <br>20/12/17 12:59:06 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(zeppelin); groups with view permissions: Set(); users with modify permissions: Set(zeppelin); groups with modify permissions: Set() <br>20/12/17 12:59:07 INFO Utils: Successfully started service 'sparkDriver' on port 45389. <br>20/12/17 12:59:07 INFO SparkEnv: Registering MapOutputTracker <br>20/12/17 12:59:07 INFO SparkEnv: Registering BlockManagerMaster <br>20/12/17 12:59:07 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information <br>20/12/17 12:59:07 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up <br>20/12/17 12:59:07 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-17406a27-5051-4c88-95b0-c405a6410260 <br>20/12/17 12:59:07 INFO MemoryStore: MemoryStore started with capacity 8.4 GB <br>20/12/17 12:59:07 INFO SparkEnv: Registering OutputCommitCoordinator <br>20/12/17 12:59:07 INFO log: Logging initialized @2420ms <br>20/12/17 12:59:07 INFO Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown <br>20/12/17 12:59:07 INFO Server: Started @2495ms <br>20/12/17 12:59:07 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. <br>20/12/17 12:59:07 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042. <br>20/12/17 12:59:07 INFO AbstractConnector: Started ServerConnector@24000415{HTTP/1.1,[http/1.1]}{0.0.0.0:4042} <br>20/12/17 12:59:07 INFO Utils: Successfully started service 'SparkUI' on port 4042. <br>20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@57b776f5{/jobs,null,AVAILABLE,@Spark} <br>20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@7b87d546{/jobs/json,null,AVAILABLE,@Spark} <br>20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@186596f1{/jobs/job,null,AVAILABLE,@Spark} <br>20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@725f2e43{/jobs/job/json,null,AVAILABLE,@Spark} <br>20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3737f8ab{/stages,null,AVAILABLE,@Spark} <br>20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3d122a2{/stages/json,null,AVAILABLE,@Spark} <br>20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@32581ebb{/stages/stage,null,AVAILABLE,@Spark} <br>20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3b4ab601{/stages/stage/json,null,AVAILABLE,@Spark} <br>20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4995ca15{/stages/pool,null,AVAILABLE,@Spark} <br>20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3e0a99{/stages/pool/json,null,AVAILABLE,@Spark} <br>20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@139c7561{/storage,null,AVAILABLE,@Spark} <br>20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@7d8833ad{/storage/json,null,AVAILABLE,@Spark} <br>20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@1a2a8b6b{/storage/rdd,null,AVAILABLE,@Spark} <br>20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@6a303075{/storage/rdd/json,null,AVAILABLE,@Spark} <br>20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@1af7672f{/environment,null,AVAILABLE,@Spark} <br>20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@265e2a87{/environment/json,null,AVAILABLE,@Spark} <br>20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@68e19cf4{/executors,null,AVAILABLE,@Spark} <br>20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@339ba05{/executors/json,null,AVAILABLE,@Spark} <br>20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@22566b52{/executors/threadDump,null,AVAILABLE,@Spark} <br>20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@750b678d{/executors/threadDump/json,null,AVAILABLE,@Spark} <br>20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@116943e4{/static,null,AVAILABLE,@Spark} <br>20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@24e6b76{/,null,AVAILABLE,@Spark} <br>20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@13c067eb{/api,null,AVAILABLE,@Spark} <br>20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@769076ae{/jobs/job/kill,null,AVAILABLE,@Spark} <br>20/12/17 12:59:07 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@59b615b3{/stages/stage/kill,null,AVAILABLE,@Spark} <br>20/12/17 12:59:07 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://master2.host:4042 <br>20/12/17 12:59:07 INFO RMProxy: Connecting to ResourceManager at master2.host/10.9.14.25:8050 <br>20/12/17 12:59:07 INFO Client: Requesting a new application from cluster with 7 NodeManagers <br>20/12/17 12:59:07 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (618496 MB per container) <br>20/12/17 12:59:07 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead <br>20/12/17 12:59:07 INFO Client: Setting up container launch context for our AM <br>20/12/17 12:59:07 INFO Client: Setting up the launch environment for our AM container <br>20/12/17 12:59:07 INFO Client: Preparing resources for our AM container <br>20/12/17 12:59:08 INFO Client: Source and destination file systems are the same. Not copying hdfs:/apps/spark2/jars/spark2-ADH-yarn-archive.tar.gz <br>20/12/17 12:59:08 INFO Client: Uploading resource file:/opt/deltalake/delta-core_2.11-0.5.0.jar -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1417/delta-core_2.11-0.5.0.jar <br>20/12/17 12:59:08 INFO Client: Uploading resource file:/home/zeppelin/env3.tar.gz#env3 -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1417/env3.tar.gz <br>20/12/17 12:59:09 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1417/pyspark.zip <br>20/12/17 12:59:09 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/py4j-0.10.7-src.zip -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1417/py4j-0.10.7-src.zip <br>20/12/17 12:59:09 INFO Client: Uploading resource file:/code/dist/jobs.zip -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1417/jobs.zip <br>20/12/17 12:59:09 WARN Client: Same path resource file:///opt/deltalake/delta-core_2.11-0.5.0.jar added multiple times to distributed cache. <br>20/12/17 12:59:09 INFO Client: Uploading resource file:/tmp/spark-46deff7e-62da-4303-88f7-8832b7e02e38/__spark_conf__3971450022345669465.zip -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1417/__spark_conf__.zip <br>20/12/17 12:59:09 INFO SecurityManager: Changing view acls to: zeppelin <br>20/12/17 12:59:09 INFO SecurityManager: Changing modify acls to: zeppelin <br>20/12/17 12:59:09 INFO SecurityManager: Changing view acls groups to: <br>20/12/17 12:59:09 INFO SecurityManager: Changing modify acls groups to: <br>20/12/17 12:59:09 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(zeppelin); groups with view permissions: Set(); users with modify permissions: Set(zeppelin); groups with modify permissions: Set() <br>20/12/17 12:59:10 INFO Client: Submitting application application_1605081684999_1417 to ResourceManager <br>20/12/17 12:59:10 INFO YarnClientImpl: Submitted application application_1605081684999_1417 <br>20/12/17 12:59:10 INFO SchedulerExtensionServices: Starting Yarn extension services with app application_1605081684999_1417 and attemptId None <br>20/12/17 12:59:11 INFO Client: Application report for application_1605081684999_1417 (state: ACCEPTED) <br>20/12/17 12:59:11 INFO Client: <br> client token: N/A <br> diagnostics: AM container is launched, waiting for AM container to Register with RM <br> ApplicationMaster host: N/A <br> ApplicationMaster RPC port: -1 <br> queue: default <br> start time: 1608199150192 <br> final status: UNDEFINED <br> tracking URL: http://master2.host:8088/proxy/application_1605081684999_1417/ <br> user: zeppelin <br>20/12/17 12:59:12 INFO Client: Application report for application_1605081684999_1417 (state: ACCEPTED) <br>20/12/17 12:59:13 INFO Client: Application report for application_1605081684999_1417 (state: ACCEPTED) <br>20/12/17 12:59:14 INFO Client: Application report for application_1605081684999_1417 (state: ACCEPTED) <br>20/12/17 12:59:15 INFO Client: Application report for application_1605081684999_1417 (state: ACCEPTED) <br>20/12/17 12:59:16 INFO YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> master2.host, PROXY_URI_BASES -> http://master2.host:8088/proxy/application_1605081684999_1417), /proxy/application_1605081684999_1417 <br>20/12/17 12:59:16 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /jobs, /jobs/json, /jobs/job, /jobs/job/json, /stages, /stages/json, /stages/stage, /stages/stage/json, /stages/pool, /stages/pool/json, /storage, /storage/json, /storage/rdd, /storage/rdd/json, /environment, /environment/json, /executors, /executors/json, /executors/threadDump, /executors/threadDump/json, /static, /, /api, /jobs/job/kill, /stages/stage/kill. <br>20/12/17 12:59:16 INFO Client: Application report for application_1605081684999_1417 (state: RUNNING) <br>20/12/17 12:59:16 INFO Client: <br> client token: N/A <br> diagnostics: N/A <br> ApplicationMaster host: 10.9.14.26 <br> ApplicationMaster RPC port: -1 <br> queue: default <br> start time: 1608199150192 <br> final status: UNDEFINED <br> tracking URL: http://master2.host:8088/proxy/application_1605081684999_1417/ <br> user: zeppelin <br>20/12/17 12:59:16 INFO YarnClientSchedulerBackend: Application application_1605081684999_1417 has started running. <br>20/12/17 12:59:16 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 41162. <br>20/12/17 12:59:16 INFO NettyBlockTransferService: Server created on master2.host:41162 <br>20/12/17 12:59:16 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy <br>20/12/17 12:59:16 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, master2.host, 41162, None) <br>20/12/17 12:59:16 INFO BlockManagerMasterEndpoint: Registering block manager master2.host:41162 with 8.4 GB RAM, BlockManagerId(driver, master2.host, 41162, None) <br>20/12/17 12:59:16 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, master2.host, 41162, None) <br>20/12/17 12:59:16 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, master2.host, 41162, None) <br>20/12/17 12:59:16 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(spark-client://YarnAM) <br>20/12/17 12:59:16 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /metrics/json. <br>20/12/17 12:59:16 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@5fccf348{/metrics/json,null,AVAILABLE,@Spark} <br>20/12/17 12:59:16 INFO EventLoggingListener: Logging events to hdfs:/spark2-history/application_1605081684999_1417 <br>20/12/17 12:59:19 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.26:40374) with ID 5 <br>20/12/17 12:59:20 INFO BlockManagerMasterEndpoint: Registering block manager node1.host:40686 with 8.4 GB RAM, BlockManagerId(5, node1.host, 40686, None) <br>20/12/17 12:59:23 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.28:47210) with ID 2 <br>20/12/17 12:59:23 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.28:47212) with ID 9 <br>20/12/17 12:59:23 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.32:51686) with ID 10 <br>20/12/17 12:59:23 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.30:57124) with ID 6 <br>20/12/17 12:59:23 INFO BlockManagerMasterEndpoint: Registering block manager node3.host:34035 with 8.4 GB RAM, BlockManagerId(2, node3.host, 34035, None) <br>20/12/17 12:59:23 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.31:55960) with ID 7 <br>20/12/17 12:59:23 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.32:51688) with ID 3 <br>20/12/17 12:59:23 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.29:44172) with ID 4 <br>20/12/17 12:59:23 INFO BlockManagerMasterEndpoint: Registering block manager node3.host:36488 with 8.4 GB RAM, BlockManagerId(9, node3.host, 36488, None) <br>20/12/17 12:59:23 INFO BlockManagerMasterEndpoint: Registering block manager node7.host:36671 with 8.4 GB RAM, BlockManagerId(10, node7.host, 36671, None) <br>20/12/17 12:59:23 INFO BlockManagerMasterEndpoint: Registering block manager node5.host:40670 with 8.4 GB RAM, BlockManagerId(6, node5.host, 40670, None) <br>20/12/17 12:59:23 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.27:46686) with ID 1 <br>20/12/17 12:59:23 INFO BlockManagerMasterEndpoint: Registering block manager node6.host:33506 with 8.4 GB RAM, BlockManagerId(7, node6.host, 33506, None) <br>20/12/17 12:59:23 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.27:46688) with ID 8 <br>20/12/17 12:59:23 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8 <br>20/12/17 12:59:23 INFO BlockManagerMasterEndpoint: Registering block manager node7.host:41013 with 8.4 GB RAM, BlockManagerId(3, node7.host, 41013, None) <br>20/12/17 12:59:23 INFO BlockManagerMasterEndpoint: Registering block manager node4.host:34615 with 8.4 GB RAM, BlockManagerId(4, node4.host, 34615, None) <br>20/12/17 12:59:24 INFO BlockManagerMasterEndpoint: Registering block manager node2.host:39108 with 8.4 GB RAM, BlockManagerId(1, node2.host, 39108, None) <br>20/12/17 12:59:24 INFO BlockManagerMasterEndpoint: Registering block manager node2.host:37151 with 8.4 GB RAM, BlockManagerId(8, node2.host, 37151, None) <br>20/12/17 12:59:24 INFO SharedState: loading hive config file: file:/etc/spark/conf.dist/hive-site.xml <br>20/12/17 12:59:24 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/code/dist/spark-warehouse'). <br>20/12/17 12:59:24 INFO SharedState: Warehouse path is 'file:/code/dist/spark-warehouse'. <br>20/12/17 12:59:24 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL. <br>20/12/17 12:59:24 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@78d74a59{/SQL,null,AVAILABLE,@Spark} <br>20/12/17 12:59:24 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL/json. <br>20/12/17 12:59:24 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3941bf4b{/SQL/json,null,AVAILABLE,@Spark} <br>20/12/17 12:59:24 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL/execution. <br>20/12/17 12:59:24 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@13201296{/SQL/execution,null,AVAILABLE,@Spark} <br>20/12/17 12:59:24 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL/execution/json. <br>20/12/17 12:59:24 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@441b0461{/SQL/execution/json,null,AVAILABLE,@Spark} <br>20/12/17 12:59:24 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /static/sql. <br>20/12/17 12:59:24 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3212cd96{/static/sql,null,AVAILABLE,@Spark} <br>20/12/17 12:59:24 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint <br>20/12/17 13:00:14 ERROR TaskSetManager: Task 93 in stage 40.0 failed 4 times; aborting job <br>Traceback (most recent call last): <br> File "/code/dist/main.py", line 155, in <module> <br> job_module.analyze(spark, args.job_args, configs) <br> File "jobs.zip/jobs/temp/__init__.py", line 29, in analyze <br> File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 523, in count <br> File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ <br> File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco <br> File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value <br>py4j.protocol.Py4JJavaError: An error occurred while calling o238.count. <br>: org.apache.spark.SparkException: Job aborted due to stage failure: Task 93 in stage 40.0 failed 4 times, most recent failure: Lost task 93.3 in stage 40.0 (TID 1138, node3.host, executor 9): java.lang.IllegalArgumentException <br> at java.nio.ByteBuffer.allocate(ByteBuffer.java:334) <br> at org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543) <br> at org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58) <br> at org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:132) <br> at org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:181) <br> at org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:172) <br> at org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:65) <br> at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:162) <br> at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122) <br> at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410) <br> at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) <br> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) <br> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) <br> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.agg_doAggregateWithoutKey_0$(Unknown Source) <br> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown Source) <br> at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) <br> at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) <br> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) <br> at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) <br> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) <br> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) <br> at org.apache.spark.scheduler.Task.run(Task.scala:123) <br> at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) <br> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) <br> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) <br> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) <br> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) <br> at java.lang.Thread.run(Thread.java:748) <br> <br>Driver stacktrace: <br> at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889) <br> at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877) <br> at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876) <br> at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) <br> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) <br> at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876) <br> at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926) <br> at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926) <br> at scala.Option.foreach(Option.scala:257) <br> at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926) <br> at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110) <br> at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059) <br> at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048) <br> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) <br> at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737) <br> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061) <br> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082) <br> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101) <br> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126) <br> at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945) <br> at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) <br> at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) <br> at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) <br> at org.apache.spark.rdd.RDD.collect(RDD.scala:944) <br> at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:299) <br> at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2836) <br> at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2835) <br> at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370) <br> at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) <br> at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) <br> at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) <br> at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369) <br> at org.apache.spark.sql.Dataset.count(Dataset.scala:2835) <br> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) <br> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) <br> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) <br> at java.lang.reflect.Method.invoke(Method.java:498) <br> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) <br> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) <br> at py4j.Gateway.invoke(Gateway.java:282) <br> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) <br> at py4j.commands.CallCommand.execute(CallCommand.java:79) <br> at py4j.GatewayConnection.run(GatewayConnection.java:238) <br> at java.lang.Thread.run(Thread.java:748) <br>Caused by: java.lang.IllegalArgumentException <br> at java.nio.ByteBuffer.allocate(ByteBuffer.java:334) <br> at org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543) <br> at org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58) <br> at org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:132) <br> at org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:181) <br> at org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:172) <br> at org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:65) <br> at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:162) <br> at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122) <br> at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410) <br> at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) <br> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) <br> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) <br> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.agg_doAggregateWithoutKey_0$(Unknown Source) <br> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown Source) <br> at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) <br> at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) <br> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) <br> at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) <br> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) <br> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) <br> at org.apache.spark.scheduler.Task.run(Task.scala:123) <br> at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) <br> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) <br> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) <br> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) <br> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) <br> ... 1 more <br>
3:
java<br> <br>20/12/17 12:43:32 INFO SparkContext: Running Spark version 2.4.4 <br>20/12/17 12:43:32 INFO SparkContext: Submitted application: temp <br>20/12/17 12:43:32 INFO SecurityManager: Changing view acls to: zeppelin <br>20/12/17 12:43:32 INFO SecurityManager: Changing modify acls to: zeppelin <br>20/12/17 12:43:32 INFO SecurityManager: Changing view acls groups to: <br>20/12/17 12:43:32 INFO SecurityManager: Changing modify acls groups to: <br>20/12/17 12:43:32 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(zeppelin); groups with view permissions: Set(); users with modify permissions: Set(zeppelin); groups with modify permissions: Set() <br>20/12/17 12:43:32 INFO Utils: Successfully started service 'sparkDriver' on port 33378. <br>20/12/17 12:43:32 INFO SparkEnv: Registering MapOutputTracker <br>20/12/17 12:43:32 INFO SparkEnv: Registering BlockManagerMaster <br>20/12/17 12:43:32 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information <br>20/12/17 12:43:32 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up <br>20/12/17 12:43:32 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-8bba8740-cc56-418e-b759-8d2d0181db2c <br>20/12/17 12:43:32 INFO MemoryStore: MemoryStore started with capacity 8.4 GB <br>20/12/17 12:43:32 INFO SparkEnv: Registering OutputCommitCoordinator <br>20/12/17 12:43:33 INFO log: Logging initialized @2474ms <br>20/12/17 12:43:33 INFO Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown <br>20/12/17 12:43:33 INFO Server: Started @2548ms <br>20/12/17 12:43:33 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. <br>20/12/17 12:43:33 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042. <br>20/12/17 12:43:33 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043. <br>20/12/17 12:43:33 WARN Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044. <br>20/12/17 12:43:33 WARN Utils: Service 'SparkUI' could not bind on port 4044. Attempting port 4045. <br>20/12/17 12:43:33 WARN Utils: Service 'SparkUI' could not bind on port 4045. Attempting port 4046. <br>20/12/17 12:43:33 INFO AbstractConnector: Started ServerConnector@4128bb9{HTTP/1.1,[http/1.1]}{0.0.0.0:4046} <br>20/12/17 12:43:33 INFO Utils: Successfully started service 'SparkUI' on port 4046. <br>20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@297aaa65{/jobs,null,AVAILABLE,@Spark} <br>20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4817a878{/jobs/json,null,AVAILABLE,@Spark} <br>20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4831e6cf{/jobs/job,null,AVAILABLE,@Spark} <br>20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@2a6bc2a5{/jobs/job/json,null,AVAILABLE,@Spark} <br>20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@5fcaebf8{/stages,null,AVAILABLE,@Spark} <br>20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@180a7950{/stages/json,null,AVAILABLE,@Spark} <br>20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@7689d370{/stages/stage,null,AVAILABLE,@Spark} <br>20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4195df87{/stages/stage/json,null,AVAILABLE,@Spark} <br>20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@793f8696{/stages/pool,null,AVAILABLE,@Spark} <br>20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@61eb6c76{/stages/pool/json,null,AVAILABLE,@Spark} <br>20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@1cb3ca39{/storage,null,AVAILABLE,@Spark} <br>20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@73b557cf{/storage/json,null,AVAILABLE,@Spark} <br>20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@763b7419{/storage/rdd,null,AVAILABLE,@Spark} <br>20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4c892e74{/storage/rdd/json,null,AVAILABLE,@Spark} <br>20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4e69c7de{/environment,null,AVAILABLE,@Spark} <br>20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@173beaf3{/environment/json,null,AVAILABLE,@Spark} <br>20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3a0dcfb1{/executors,null,AVAILABLE,@Spark} <br>20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3ff2ac0a{/executors/json,null,AVAILABLE,@Spark} <br>20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3fa5db1d{/executors/threadDump,null,AVAILABLE,@Spark} <br>20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@7701d268{/executors/threadDump/json,null,AVAILABLE,@Spark} <br>20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@23a5eb7e{/static,null,AVAILABLE,@Spark} <br>20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@55529b1e{/,null,AVAILABLE,@Spark} <br>20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@6ad6fed6{/api,null,AVAILABLE,@Spark} <br>20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@42315ca4{/jobs/job/kill,null,AVAILABLE,@Spark} <br>20/12/17 12:43:33 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@797cae32{/stages/stage/kill,null,AVAILABLE,@Spark} <br>20/12/17 12:43:33 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://master2.host:4046 <br>20/12/17 12:43:33 INFO RMProxy: Connecting to ResourceManager at master2.host/10.9.14.25:8050 <br>20/12/17 12:43:33 INFO Client: Requesting a new application from cluster with 7 NodeManagers <br>20/12/17 12:43:33 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (618496 MB per container) <br>20/12/17 12:43:33 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead <br>20/12/17 12:43:33 INFO Client: Setting up container launch context for our AM <br>20/12/17 12:43:33 INFO Client: Setting up the launch environment for our AM container <br>20/12/17 12:43:33 INFO Client: Preparing resources for our AM container <br>20/12/17 12:43:33 INFO Client: Source and destination file systems are the same. Not copying hdfs:/apps/spark2/jars/spark2-ADH-yarn-archive.tar.gz <br>20/12/17 12:43:33 INFO Client: Uploading resource file:/opt/deltalake/delta-core_2.11-0.5.0.jar -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1414/delta-core_2.11-0.5.0.jar <br>20/12/17 12:43:34 INFO Client: Uploading resource file:/home/zeppelin/env3.tar.gz#env3 -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1414/env3.tar.gz <br>20/12/17 12:43:34 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1414/pyspark.zip <br>20/12/17 12:43:34 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/py4j-0.10.7-src.zip -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1414/py4j-0.10.7-src.zip <br>20/12/17 12:43:34 INFO Client: Uploading resource file:/code/dist/jobs.zip -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1414/jobs.zip <br>20/12/17 12:43:34 WARN Client: Same path resource file:///opt/deltalake/delta-core_2.11-0.5.0.jar added multiple times to distributed cache. <br>20/12/17 12:43:34 INFO Client: Uploading resource file:/tmp/spark-7fab6d5d-5cf9-441d-9f8b-4a99a13d5e85/__spark_conf__1923213682968598810.zip -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1414/__spark_conf__.zip <br>20/12/17 12:43:34 INFO SecurityManager: Changing view acls to: zeppelin <br>20/12/17 12:43:34 INFO SecurityManager: Changing modify acls to: zeppelin <br>20/12/17 12:43:34 INFO SecurityManager: Changing view acls groups to: <br>20/12/17 12:43:34 INFO SecurityManager: Changing modify acls groups to: <br>20/12/17 12:43:34 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(zeppelin); groups with view permissions: Set(); users with modify permissions: Set(zeppelin); groups with modify permissions: Set() <br>20/12/17 12:43:35 INFO Client: Submitting application application_1605081684999_1414 to ResourceManager <br>20/12/17 12:43:35 INFO YarnClientImpl: Submitted application application_1605081684999_1414 <br>20/12/17 12:43:35 INFO SchedulerExtensionServices: Starting Yarn extension services with app application_1605081684999_1414 and attemptId None <br>20/12/17 12:43:36 INFO Client: Application report for application_1605081684999_1414 (state: ACCEPTED) <br>20/12/17 12:43:36 INFO Client: <br> client token: N/A <br> diagnostics: AM container is launched, waiting for AM container to Register with RM <br> ApplicationMaster host: N/A <br> ApplicationMaster RPC port: -1 <br> queue: default <br> start time: 1608198215763 <br> final status: UNDEFINED <br> tracking URL: http://master2.host:8088/proxy/application_1605081684999_1414/ <br> user: zeppelin <br>20/12/17 12:43:37 INFO Client: Application report for application_1605081684999_1414 (state: ACCEPTED) <br>20/12/17 12:43:38 INFO Client: Application report for application_1605081684999_1414 (state: ACCEPTED) <br>20/12/17 12:43:39 INFO Client: Application report for application_1605081684999_1414 (state: ACCEPTED) <br>20/12/17 12:43:40 INFO Client: Application report for application_1605081684999_1414 (state: ACCEPTED) <br>20/12/17 12:43:41 INFO Client: Application report for application_1605081684999_1414 (state: ACCEPTED) <br>20/12/17 12:43:42 INFO Client: Application report for application_1605081684999_1414 (state: ACCEPTED) <br>20/12/17 12:43:43 INFO Client: Application report for application_1605081684999_1414 (state: ACCEPTED) <br>20/12/17 12:43:44 INFO YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> master2.host, PROXY_URI_BASES -> http://master2.host:8088/proxy/application_1605081684999_1414), /proxy/application_1605081684999_1414 <br>20/12/17 12:43:44 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /jobs, /jobs/json, /jobs/job, /jobs/job/json, /stages, /stages/json, /stages/stage, /stages/stage/json, /stages/pool, /stages/pool/json, /storage, /storage/json, /storage/rdd, /storage/rdd/json, /environment, /environment/json, /executors, /executors/json, /executors/threadDump, /executors/threadDump/json, /static, /, /api, /jobs/job/kill, /stages/stage/kill. <br>20/12/17 12:43:44 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(spark-client://YarnAM) <br>20/12/17 12:43:44 INFO Client: Application report for application_1605081684999_1414 (state: RUNNING) <br>20/12/17 12:43:44 INFO Client: <br> client token: N/A <br> diagnostics: N/A <br> ApplicationMaster host: 10.9.14.27 <br> ApplicationMaster RPC port: -1 <br> queue: default <br> start time: 1608198215763 <br> final status: UNDEFINED <br> tracking URL: http://master2.host:8088/proxy/application_1605081684999_1414/ <br> user: zeppelin <br>20/12/17 12:43:44 INFO YarnClientSchedulerBackend: Application application_1605081684999_1414 has started running. <br>20/12/17 12:43:44 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 39612. <br>20/12/17 12:43:44 INFO NettyBlockTransferService: Server created on master2.host:39612 <br>20/12/17 12:43:44 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy <br>20/12/17 12:43:44 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, master2.host, 39612, None) <br>20/12/17 12:43:44 INFO BlockManagerMasterEndpoint: Registering block manager master2.host:39612 with 8.4 GB RAM, BlockManagerId(driver, master2.host, 39612, None) <br>20/12/17 12:43:44 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, master2.host, 39612, None) <br>20/12/17 12:43:44 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, master2.host, 39612, None) <br>20/12/17 12:43:45 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /metrics/json. <br>20/12/17 12:43:45 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4d7fcb59{/metrics/json,null,AVAILABLE,@Spark} <br>20/12/17 12:43:45 INFO EventLoggingListener: Logging events to hdfs:/spark2-history/application_1605081684999_1414 <br>20/12/17 12:43:47 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.27:37836) with ID 4 <br>20/12/17 12:43:47 INFO BlockManagerMasterEndpoint: Registering block manager node2.host:39627 with 8.4 GB RAM, BlockManagerId(4, node2.host, 39627, None) <br>20/12/17 12:43:51 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.28:51120) with ID 6 <br>20/12/17 12:43:51 INFO BlockManagerMasterEndpoint: Registering block manager node3.host:33184 with 8.4 GB RAM, BlockManagerId(6, node3.host, 33184, None) <br>20/12/17 12:43:51 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.26:43924) with ID 2 <br>20/12/17 12:43:52 INFO BlockManagerMasterEndpoint: Registering block manager node1.host:46720 with 8.4 GB RAM, BlockManagerId(2, node1.host, 46720, None) <br>20/12/17 12:43:52 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.32:59464) with ID 5 <br>20/12/17 12:43:52 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.30:36348) with ID 8 <br>20/12/17 12:43:52 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.26:43926) with ID 9 <br>20/12/17 12:43:52 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.30:36346) with ID 1 <br>20/12/17 12:43:52 INFO BlockManagerMasterEndpoint: Registering block manager node7.host:40610 with 8.4 GB RAM, BlockManagerId(5, node7.host, 40610, None) <br>20/12/17 12:43:52 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.29:39682) with ID 10 <br>20/12/17 12:43:52 INFO BlockManagerMasterEndpoint: Registering block manager node1.host:46068 with 8.4 GB RAM, BlockManagerId(9, node1.host, 46068, None) <br>20/12/17 12:43:52 INFO BlockManagerMasterEndpoint: Registering block manager node5.host:45165 with 8.4 GB RAM, BlockManagerId(1, node5.host, 45165, None) <br>20/12/17 12:43:52 INFO BlockManagerMasterEndpoint: Registering block manager node5.host:37627 with 8.4 GB RAM, BlockManagerId(8, node5.host, 37627, None) <br>20/12/17 12:43:52 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.29:39684) with ID 3 <br>20/12/17 12:43:52 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8 <br>20/12/17 12:43:52 INFO BlockManagerMasterEndpoint: Registering block manager node4.host:40597 with 8.4 GB RAM, BlockManagerId(10, node4.host, 40597, None) <br>20/12/17 12:43:52 INFO SharedState: loading hive config file: file:/etc/spark/conf.dist/hive-site.xml <br>20/12/17 12:43:52 INFO BlockManagerMasterEndpoint: Registering block manager node4.host:46284 with 8.4 GB RAM, BlockManagerId(3, node4.host, 46284, None) <br>20/12/17 12:43:52 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/code/dist/spark-warehouse'). <br>20/12/17 12:43:52 INFO SharedState: Warehouse path is 'file:/code/dist/spark-warehouse'. <br>20/12/17 12:43:52 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL. <br>20/12/17 12:43:52 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4f67d4c0{/SQL,null,AVAILABLE,@Spark} <br>20/12/17 12:43:52 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL/json. <br>20/12/17 12:43:52 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@2e1bff0b{/SQL/json,null,AVAILABLE,@Spark} <br>20/12/17 12:43:52 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL/execution. <br>20/12/17 12:43:52 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@9336837{/SQL/execution,null,AVAILABLE,@Spark} <br>20/12/17 12:43:52 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL/execution/json. <br>20/12/17 12:43:52 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@829dc36{/SQL/execution/json,null,AVAILABLE,@Spark} <br>20/12/17 12:43:52 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /static/sql. <br>20/12/17 12:43:52 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4e8262e4{/static/sql,null,AVAILABLE,@Spark} <br>20/12/17 12:43:52 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint <br>20/12/17 12:46:39 ERROR TaskSetManager: Task 93 in stage 31.0 failed 4 times; aborting job <br>Traceback (most recent call last): <br> File "/code/dist/main.py", line 155, in <module> <br> job_module.analyze(spark, args.job_args, configs) <br> File "jobs.zip/jobs/temp/__init__.py", line 29, in analyze <br> File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 523, in count <br> File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ <br> File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco <br> File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value <br>py4j.protocol.Py4JJavaError: An error occurred while calling o238.count. <br>: org.apache.spark.SparkException: Job aborted due to stage failure: Task 93 in stage 31.0 failed 4 times, most recent failure: Lost task 93.3 in stage 31.0 (TID 1030, node2.host, executor 4): org.apache.spark.api.python.PythonException: Traceback (most recent call last): <br> File "/hadoop/hdfs/data06/local/usercache/zeppelin/appcache/application_1605081684999_1414/container_e20_1605081684999_1414_01_000005/pyspark.zip/pyspark/worker.py", line 377, in main <br> process() <br> File "/hadoop/hdfs/data06/local/usercache/zeppelin/appcache/application_1605081684999_1414/container_e20_1605081684999_1414_01_000005/pyspark.zip/pyspark/worker.py", line 372, in process <br> serializer.dump_stream(func(split_index, iterator), outfile) <br> File "/hadoop/hdfs/data06/local/usercache/zeppelin/appcache/application_1605081684999_1414/container_e20_1605081684999_1414_01_000005/pyspark.zip/pyspark/serializers.py", line 286, in dump_stream <br> for series in iterator: <br> File "/hadoop/hdfs/data06/local/usercache/zeppelin/appcache/application_1605081684999_1414/container_e20_1605081684999_1414_01_000005/pyspark.zip/pyspark/serializers.py", line 303, in load_stream <br> for batch in reader: <br> File "pyarrow/ipc.pxi", line 265, in __iter__ <br> File "pyarrow/ipc.pxi", line 281, in pyarrow.lib._RecordBatchReader.read_next_batch <br> File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status <br>pyarrow.lib.ArrowIOError: read length must be positive or -1 <br> <br> at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456) <br> at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:172) <br> at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122) <br> at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410) <br> at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) <br> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) <br> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) <br> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.agg_doAggregateWithoutKey_0$(Unknown Source) <br> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown Source) <br> at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) <br> at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) <br> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) <br> at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) <br> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) <br> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) <br> at org.apache.spark.scheduler.Task.run(Task.scala:123) <br> at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) <br> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) <br> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) <br> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) <br> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) <br> at java.lang.Thread.run(Thread.java:748) <br> <br>Driver stacktrace: <br> at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889) <br> at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877) <br> at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876) <br> at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) <br> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) <br> at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876) <br> at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926) <br> at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926) <br> at scala.Option.foreach(Option.scala:257) <br> at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926) <br> at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110) <br> at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059) <br> at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048) <br> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) <br> at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737) <br> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061) <br> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082) <br> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101) <br> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126) <br> at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945) <br> at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) <br> at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) <br> at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) <br> at org.apache.spark.rdd.RDD.collect(RDD.scala:944) <br> at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:299) <br> at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2836) <br> at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2835) <br> at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370) <br> at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) <br> at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) <br> at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) <br> at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369) <br> at org.apache.spark.sql.Dataset.count(Dataset.scala:2835) <br> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) <br> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) <br> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) <br> at java.lang.reflect.Method.invoke(Method.java:498) <br> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) <br> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) <br> at py4j.Gateway.invoke(Gateway.java:282) <br> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) <br> at py4j.commands.CallCommand.execute(CallCommand.java:79) <br> at py4j.GatewayConnection.run(GatewayConnection.java:238) <br> at java.lang.Thread.run(Thread.java:748) <br>Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last): <br> File "/hadoop/hdfs/data06/local/usercache/zeppelin/appcache/application_1605081684999_1414/container_e20_1605081684999_1414_01_000005/pyspark.zip/pyspark/worker.py", line 377, in main <br> process() <br> File "/hadoop/hdfs/data06/local/usercache/zeppelin/appcache/application_1605081684999_1414/container_e20_1605081684999_1414_01_000005/pyspark.zip/pyspark/worker.py", line 372, in process <br> serializer.dump_stream(func(split_index, iterator), outfile) <br> File "/hadoop/hdfs/data06/local/usercache/zeppelin/appcache/application_1605081684999_1414/container_e20_1605081684999_1414_01_000005/pyspark.zip/pyspark/serializers.py", line 286, in dump_stream <br> for series in iterator: <br> File "/hadoop/hdfs/data06/local/usercache/zeppelin/appcache/application_1605081684999_1414/container_e20_1605081684999_1414_01_000005/pyspark.zip/pyspark/serializers.py", line 303, in load_stream <br> for batch in reader: <br> File "pyarrow/ipc.pxi", line 265, in __iter__ <br> File "pyarrow/ipc.pxi", line 281, in pyarrow.lib._RecordBatchReader.read_next_batch <br> File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status <br>pyarrow.lib.ArrowIOError: read length must be positive or -1 <br> <br> at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456) <br> at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:172) <br> at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122) <br> at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410) <br> at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) <br> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) <br> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) <br> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.agg_doAggregateWithoutKey_0$(Unknown Source) <br> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown Source) <br> at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) <br> at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) <br> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) <br> at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) <br> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) <br> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) <br> at org.apache.spark.scheduler.Task.run(Task.scala:123) <br> at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) <br> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) <br> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) <br> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) <br> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) <br> ... 1 more <br>
4:
java<br> <br>20/12/17 13:35:40 INFO SparkContext: Running Spark version 2.4.4 <br>20/12/17 13:35:40 INFO SparkContext: Submitted application: temp <br>20/12/17 13:35:40 INFO SecurityManager: Changing view acls to: zeppelin <br>20/12/17 13:35:40 INFO SecurityManager: Changing modify acls to: zeppelin <br>20/12/17 13:35:40 INFO SecurityManager: Changing view acls groups to: <br>20/12/17 13:35:40 INFO SecurityManager: Changing modify acls groups to: <br>20/12/17 13:35:40 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(zeppelin); groups with view permissions: Set(); users with modify permissions: Set(zeppelin); groups with modify permissions: Set() <br>20/12/17 13:35:40 INFO Utils: Successfully started service 'sparkDriver' on port 33335. <br>20/12/17 13:35:40 INFO SparkEnv: Registering MapOutputTracker <br>20/12/17 13:35:40 INFO SparkEnv: Registering BlockManagerMaster <br>20/12/17 13:35:40 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information <br>20/12/17 13:35:40 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up <br>20/12/17 13:35:40 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-aea190e5-8c84-40e4-9f57-cf8af75f73f7 <br>20/12/17 13:35:40 INFO MemoryStore: MemoryStore started with capacity 8.4 GB <br>20/12/17 13:35:40 INFO SparkEnv: Registering OutputCommitCoordinator <br>20/12/17 13:35:40 INFO log: Logging initialized @2440ms <br>20/12/17 13:35:40 INFO Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown <br>20/12/17 13:35:40 INFO Server: Started @2511ms <br>20/12/17 13:35:40 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. <br>20/12/17 13:35:40 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042. <br>20/12/17 13:35:40 INFO AbstractConnector: Started ServerConnector@594d8c7d{HTTP/1.1,[http/1.1]}{0.0.0.0:4042} <br>20/12/17 13:35:40 INFO Utils: Successfully started service 'SparkUI' on port 4042. <br>20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@2271a0eb{/jobs,null,AVAILABLE,@Spark} <br>20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@30aa0825{/jobs/json,null,AVAILABLE,@Spark} <br>20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@42d2b91d{/jobs/job,null,AVAILABLE,@Spark} <br>20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@73f56489{/jobs/job/json,null,AVAILABLE,@Spark} <br>20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@137ea1f2{/stages,null,AVAILABLE,@Spark} <br>20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@44b3e8d1{/stages/json,null,AVAILABLE,@Spark} <br>20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@239dc14b{/stages/stage,null,AVAILABLE,@Spark} <br>20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@46ade13d{/stages/stage/json,null,AVAILABLE,@Spark} <br>20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@b77f73a{/stages/pool,null,AVAILABLE,@Spark} <br>20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@517df609{/stages/pool/json,null,AVAILABLE,@Spark} <br>20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@57c1c10d{/storage,null,AVAILABLE,@Spark} <br>20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@7ecc76c1{/storage/json,null,AVAILABLE,@Spark} <br>20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4a369005{/storage/rdd,null,AVAILABLE,@Spark} <br>20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@7445eaf4{/storage/rdd/json,null,AVAILABLE,@Spark} <br>20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@2d5594d8{/environment,null,AVAILABLE,@Spark} <br>20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@303295cd{/environment/json,null,AVAILABLE,@Spark} <br>20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4e7a9d76{/executors,null,AVAILABLE,@Spark} <br>20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@15b3dc07{/executors/json,null,AVAILABLE,@Spark} <br>20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@14c1793d{/executors/threadDump,null,AVAILABLE,@Spark} <br>20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@30f90a95{/executors/threadDump/json,null,AVAILABLE,@Spark} <br>20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@2a72b0d1{/static,null,AVAILABLE,@Spark} <br>20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@76865c63{/,null,AVAILABLE,@Spark} <br>20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3bd4e264{/api,null,AVAILABLE,@Spark} <br>20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4b660222{/jobs/job/kill,null,AVAILABLE,@Spark} <br>20/12/17 13:35:40 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@16f643b2{/stages/stage/kill,null,AVAILABLE,@Spark} <br>20/12/17 13:35:40 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://master2.host:4042 <br>20/12/17 13:35:41 INFO RMProxy: Connecting to ResourceManager at master2.host/10.9.14.25:8050 <br>20/12/17 13:35:41 INFO Client: Requesting a new application from cluster with 7 NodeManagers <br>20/12/17 13:35:41 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (618496 MB per container) <br>20/12/17 13:35:41 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead <br>20/12/17 13:35:41 INFO Client: Setting up container launch context for our AM <br>20/12/17 13:35:41 INFO Client: Setting up the launch environment for our AM container <br>20/12/17 13:35:41 INFO Client: Preparing resources for our AM container <br>20/12/17 13:35:41 INFO Client: Source and destination file systems are the same. Not copying hdfs:/apps/spark2/jars/spark2-ADH-yarn-archive.tar.gz <br>20/12/17 13:35:41 INFO Client: Uploading resource file:/opt/deltalake/delta-core_2.11-0.5.0.jar -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1428/delta-core_2.11-0.5.0.jar <br>20/12/17 13:35:41 INFO Client: Uploading resource file:/home/zeppelin/env3.tar.gz#env3 -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1428/env3.tar.gz <br>20/12/17 13:35:41 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1428/pyspark.zip <br>20/12/17 13:35:41 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/py4j-0.10.7-src.zip -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1428/py4j-0.10.7-src.zip <br>20/12/17 13:35:41 INFO Client: Uploading resource file:/code/dist/jobs.zip -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1428/jobs.zip <br>20/12/17 13:35:41 WARN Client: Same path resource file:///opt/deltalake/delta-core_2.11-0.5.0.jar added multiple times to distributed cache. <br>20/12/17 13:35:42 INFO Client: Uploading resource file:/tmp/spark-9804f727-4c1a-44f7-ae39-26ec382332a7/__spark_conf__1703784814635255608.zip -> hdfs://master1.host:8020/user/zeppelin/.sparkStaging/application_1605081684999_1428/__spark_conf__.zip <br>20/12/17 13:35:42 INFO SecurityManager: Changing view acls to: zeppelin <br>20/12/17 13:35:42 INFO SecurityManager: Changing modify acls to: zeppelin <br>20/12/17 13:35:42 INFO SecurityManager: Changing view acls groups to: <br>20/12/17 13:35:42 INFO SecurityManager: Changing modify acls groups to: <br>20/12/17 13:35:42 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(zeppelin); groups with view permissions: Set(); users with modify permissions: Set(zeppelin); groups with modify permissions: Set() <br>20/12/17 13:35:43 INFO Client: Submitting application application_1605081684999_1428 to ResourceManager <br>20/12/17 13:35:43 INFO YarnClientImpl: Submitted application application_1605081684999_1428 <br>20/12/17 13:35:43 INFO SchedulerExtensionServices: Starting Yarn extension services with app application_1605081684999_1428 and attemptId None <br>20/12/17 13:35:44 INFO Client: Application report for application_1605081684999_1428 (state: ACCEPTED) <br>20/12/17 13:35:44 INFO Client: <br> client token: N/A <br> diagnostics: AM container is launched, waiting for AM container to Register with RM <br> ApplicationMaster host: N/A <br> ApplicationMaster RPC port: -1 <br> queue: default <br> start time: 1608201343091 <br> final status: UNDEFINED <br> tracking URL: http://master2.host:8088/proxy/application_1605081684999_1428/ <br> user: zeppelin <br>20/12/17 13:35:45 INFO Client: Application report for application_1605081684999_1428 (state: ACCEPTED) <br>20/12/17 13:35:46 INFO Client: Application report for application_1605081684999_1428 (state: ACCEPTED) <br>20/12/17 13:35:47 INFO Client: Application report for application_1605081684999_1428 (state: ACCEPTED) <br>20/12/17 13:35:48 INFO Client: Application report for application_1605081684999_1428 (state: ACCEPTED) <br>20/12/17 13:35:49 INFO Client: Application report for application_1605081684999_1428 (state: ACCEPTED) <br>20/12/17 13:35:50 INFO Client: Application report for application_1605081684999_1428 (state: ACCEPTED) <br>20/12/17 13:35:51 INFO Client: Application report for application_1605081684999_1428 (state: ACCEPTED) <br>20/12/17 13:35:52 INFO Client: Application report for application_1605081684999_1428 (state: ACCEPTED) <br>20/12/17 13:35:52 INFO YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> master2.host, PROXY_URI_BASES -> http://master2.host:8088/proxy/application_1605081684999_1428), /proxy/application_1605081684999_1428 <br>20/12/17 13:35:52 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /jobs, /jobs/json, /jobs/job, /jobs/job/json, /stages, /stages/json, /stages/stage, /stages/stage/json, /stages/pool, /stages/pool/json, /storage, /storage/json, /storage/rdd, /storage/rdd/json, /environment, /environment/json, /executors, /executors/json, /executors/threadDump, /executors/threadDump/json, /static, /, /api, /jobs/job/kill, /stages/stage/kill. <br>20/12/17 13:35:53 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(spark-client://YarnAM) <br>20/12/17 13:35:53 INFO Client: Application report for application_1605081684999_1428 (state: RUNNING) <br>20/12/17 13:35:53 INFO Client: <br> client token: N/A <br> diagnostics: N/A <br> ApplicationMaster host: 10.9.14.29 <br> ApplicationMaster RPC port: -1 <br> queue: default <br> start time: 1608201343091 <br> final status: UNDEFINED <br> tracking URL: http://master2.host:8088/proxy/application_1605081684999_1428/ <br> user: zeppelin <br>20/12/17 13:35:53 INFO YarnClientSchedulerBackend: Application application_1605081684999_1428 has started running. <br>20/12/17 13:35:53 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 38210. <br>20/12/17 13:35:53 INFO NettyBlockTransferService: Server created on master2.host:38210 <br>20/12/17 13:35:53 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy <br>20/12/17 13:35:53 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, master2.host, 38210, None) <br>20/12/17 13:35:53 INFO BlockManagerMasterEndpoint: Registering block manager master2.host:38210 with 8.4 GB RAM, BlockManagerId(driver, master2.host, 38210, None) <br>20/12/17 13:35:53 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, master2.host, 38210, None) <br>20/12/17 13:35:53 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, master2.host, 38210, None) <br>20/12/17 13:35:53 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /metrics/json. <br>20/12/17 13:35:53 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4b00b26c{/metrics/json,null,AVAILABLE,@Spark} <br>20/12/17 13:35:53 INFO EventLoggingListener: Logging events to hdfs:/spark2-history/application_1605081684999_1428 <br>20/12/17 13:35:57 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.29:42340) with ID 1 <br>20/12/17 13:35:57 INFO BlockManagerMasterEndpoint: Registering block manager node4.host:42288 with 8.4 GB RAM, BlockManagerId(1, node4.host, 42288, None) <br>20/12/17 13:35:58 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.29:42346) with ID 8 <br>20/12/17 13:35:58 INFO BlockManagerMasterEndpoint: Registering block manager node4.host:46543 with 8.4 GB RAM, BlockManagerId(8, node4.host, 46543, None) <br>20/12/17 13:36:02 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.26:55632) with ID 9 <br>20/12/17 13:36:02 INFO BlockManagerMasterEndpoint: Registering block manager node1.host:34973 with 8.4 GB RAM, BlockManagerId(9, node1.host, 34973, None) <br>20/12/17 13:36:02 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.26:55634) with ID 2 <br>20/12/17 13:36:02 INFO BlockManagerMasterEndpoint: Registering block manager node1.host:35223 with 8.4 GB RAM, BlockManagerId(2, node1.host, 35223, None) <br>20/12/17 13:36:02 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.31:49516) with ID 6 <br>20/12/17 13:36:02 INFO BlockManagerMasterEndpoint: Registering block manager node6.host:35440 with 8.4 GB RAM, BlockManagerId(6, node6.host, 35440, None) <br>20/12/17 13:36:02 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.28:49014) with ID 4 <br>20/12/17 13:36:02 INFO BlockManagerMasterEndpoint: Registering block manager node3.host:46367 with 8.4 GB RAM, BlockManagerId(4, node3.host, 46367, None) <br>20/12/17 13:36:03 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.32:38700) with ID 7 <br>20/12/17 13:36:03 INFO BlockManagerMasterEndpoint: Registering block manager node7.host:44537 with 8.4 GB RAM, BlockManagerId(7, node7.host, 44537, None) <br>20/12/17 13:36:03 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.27:51310) with ID 10 <br>20/12/17 13:36:03 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8 <br>20/12/17 13:36:03 INFO BlockManagerMasterEndpoint: Registering block manager node2.host:35748 with 8.4 GB RAM, BlockManagerId(10, node2.host, 35748, None) <br>20/12/17 13:36:03 INFO SharedState: loading hive config file: file:/etc/spark/conf.dist/hive-site.xml <br>20/12/17 13:36:03 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/code/dist/spark-warehouse'). <br>20/12/17 13:36:03 INFO SharedState: Warehouse path is 'file:/code/dist/spark-warehouse'. <br>20/12/17 13:36:03 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL. <br>20/12/17 13:36:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@674d79fd{/SQL,null,AVAILABLE,@Spark} <br>20/12/17 13:36:03 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL/json. <br>20/12/17 13:36:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@9d4247b{/SQL/json,null,AVAILABLE,@Spark} <br>20/12/17 13:36:03 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL/execution. <br>20/12/17 13:36:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@76da7433{/SQL/execution,null,AVAILABLE,@Spark} <br>20/12/17 13:36:03 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /SQL/execution/json. <br>20/12/17 13:36:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@3d37b97{/SQL/execution/json,null,AVAILABLE,@Spark} <br>20/12/17 13:36:03 INFO JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /static/sql. <br>20/12/17 13:36:03 INFO ContextHandler: Started o.s.j.s.ServletContextHandler@59de5714{/static/sql,null,AVAILABLE,@Spark} <br>20/12/17 13:36:04 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.27:51324) with ID 3 <br>20/12/17 13:36:04 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.9.14.30:55732) with ID 5 <br>20/12/17 13:36:04 INFO BlockManagerMasterEndpoint: Registering block manager node2.host:40602 with 8.4 GB RAM, BlockManagerId(3, node2.host, 40602, None) <br>20/12/17 13:36:04 INFO BlockManagerMasterEndpoint: Registering block manager node5.host:45001 with 8.4 GB RAM, BlockManagerId(5, node5.host, 45001, None) <br>20/12/17 13:36:04 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint <br>20/12/17 13:39:34 ERROR TaskSetManager: Task 93 in stage 31.0 failed 4 times; aborting job <br>Traceback (most recent call last): <br> File "/code/dist/main.py", line 155, in <module> <br> job_module.analyze(spark, args.job_args, configs) <br> File "jobs.zip/jobs/temp/__init__.py", line 29, in analyze <br> File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 523, in count <br> File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ <br> File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco <br> File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value <br>py4j.protocol.Py4JJavaError: An error occurred while calling o238.count. <br>: org.apache.spark.SparkException: Job aborted due to stage failure: Task 93 in stage 31.0 failed 4 times, most recent failure: Lost task 93.3 in stage 31.0 (TID 1030, node2.host, executor 10): org.apache.spark.api.python.PythonException: Traceback (most recent call last): <br> File "/hadoop/hdfs/data03/local/usercache/zeppelin/appcache/application_1605081684999_1428/container_e20_1605081684999_1428_01_000011/pyspark.zip/pyspark/worker.py", line 377, in main <br> process() <br> File "/hadoop/hdfs/data03/local/usercache/zeppelin/appcache/application_1605081684999_1428/container_e20_1605081684999_1428_01_000011/pyspark.zip/pyspark/worker.py", line 372, in process <br> serializer.dump_stream(func(split_index, iterator), outfile) <br> File "/hadoop/hdfs/data03/local/usercache/zeppelin/appcache/application_1605081684999_1428/container_e20_1605081684999_1428_01_000011/pyspark.zip/pyspark/serializers.py", line 286, in dump_stream <br> for series in iterator: <br> File "/hadoop/hdfs/data03/local/usercache/zeppelin/appcache/application_1605081684999_1428/container_e20_1605081684999_1428_01_000011/pyspark.zip/pyspark/serializers.py", line 303, in load_stream <br> for batch in reader: <br> File "pyarrow/ipc.pxi", line 412, in __iter__ <br> File "pyarrow/ipc.pxi", line 432, in pyarrow.lib._CRecordBatchReader.read_next_batch <br> File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status <br>OSError: Invalid IPC message: negative bodyLength <br> <br> at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456) <br> at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:172) <br> at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122) <br> at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410) <br> at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) <br> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) <br> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) <br> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.agg_doAggregateWithoutKey_0$(Unknown Source) <br> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown Source) <br> at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) <br> at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) <br> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) <br> at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) <br> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) <br> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) <br> at org.apache.spark.scheduler.Task.run(Task.scala:123) <br> at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) <br> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) <br> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) <br> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) <br> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) <br> at java.lang.Thread.run(Thread.java:748) <br> <br>Driver stacktrace: <br> at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889) <br> at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877) <br> at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876) <br> at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) <br> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) <br> at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876) <br> at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926) <br> at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926) <br> at scala.Option.foreach(Option.scala:257) <br> at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926) <br> at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110) <br> at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059) <br> at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048) <br> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) <br> at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737) <br> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061) <br> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082) <br> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101) <br> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126) <br> at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945) <br> at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) <br> at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) <br> at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) <br> at org.apache.spark.rdd.RDD.collect(RDD.scala:944) <br> at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:299) <br> at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2836) <br> at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2835) <br> at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370) <br> at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) <br> at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) <br> at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) <br> at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369) <br> at org.apache.spark.sql.Dataset.count(Dataset.scala:2835) <br> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) <br> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) <br> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) <br> at java.lang.reflect.Method.invoke(Method.java:498) <br> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) <br> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) <br> at py4j.Gateway.invoke(Gateway.java:282) <br> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) <br> at py4j.commands.CallCommand.execute(CallCommand.java:79) <br> at py4j.GatewayConnection.run(GatewayConnection.java:238) <br> at java.lang.Thread.run(Thread.java:748) <br>Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last): <br> File "/hadoop/hdfs/data03/local/usercache/zeppelin/appcache/application_1605081684999_1428/container_e20_1605081684999_1428_01_000011/pyspark.zip/pyspark/worker.py", line 377, in main <br> process() <br> File "/hadoop/hdfs/data03/local/usercache/zeppelin/appcache/application_1605081684999_1428/container_e20_1605081684999_1428_01_000011/pyspark.zip/pyspark/worker.py", line 372, in process <br> serializer.dump_stream(func(split_index, iterator), outfile) <br> File "/hadoop/hdfs/data03/local/usercache/zeppelin/appcache/application_1605081684999_1428/container_e20_1605081684999_1428_01_000011/pyspark.zip/pyspark/serializers.py", line 286, in dump_stream <br> for series in iterator: <br> File "/hadoop/hdfs/data03/local/usercache/zeppelin/appcache/application_1605081684999_1428/container_e20_1605081684999_1428_01_000011/pyspark.zip/pyspark/serializers.py", line 303, in load_stream <br> for batch in reader: <br> File "pyarrow/ipc.pxi", line 412, in __iter__ <br> File "pyarrow/ipc.pxi", line 432, in pyarrow.lib._CRecordBatchReader.read_next_batch <br> File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status <br>OSError: Invalid IPC message: negative bodyLength <br> <br> at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456) <br> at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:172) <br> at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122) <br> at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410) <br> at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) <br> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) <br> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) <br> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.agg_doAggregateWithoutKey_0$(Unknown Source) <br> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown Source) <br> at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) <br> at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) <br> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) <br> at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) <br> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) <br> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) <br> at org.apache.spark.scheduler.Task.run(Task.scala:123) <br> at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) <br> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) <br> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) <br> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) <br> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) <br> ... 1 more <br>
So my question is can I use more than 2gb with pandas_udf approach for python + spark stack with current stable versions of spark and pyarrow?|
Liya Fan / @liyafan82: [~dishka_krauch] Thanks for the detailed information. It seems the stack trace of the root cause is as follows:
java.lang.IllegalArgumentException
at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
at org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543)
at org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58)
at org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:132)
at org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:181)
at org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:172)
at org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:65)
The negative capacity was likely to be caused by integer overflow. According to Arrow specification, the message size is represented as a 32-bit little endian integer (see https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc), so the message size cannot exceed 2GB.
To overcome this constraint, we need to change the specification, which will be a large change, involving multiple languages. If you really need to do so, maybe you can open a discussion in the mailing list.
Micah Kornfield / @emkornfield: [~fan_li_ya] I believe the 32-bit integer is for the message metadata and the actual message size. (All buffers use 64 bit lengths).
Dimitry are you using stock Spark 2.4.4? If so I believe the version of jave version Arrow is quite old and at the very lest you would have to set the environment variable ARROW_PRE_0_15_IPC_FORMAT=1 to have anything work (https://arrow.apache.org/blog/2019/10/06/0.15.0-release/)
I think there is still some coding work to upgrade Spark to a new version of Arrow that is potentially capable of having data > 2GB.
Liya Fan / @liyafan82: @emkornfield You are right. According to the above stack trace, the exception was thrown when reading schema.
Dmitry Kravchuk: [~fan_li_ya] @emkornfield thanks for your replies!
Yes, I'm using spark 2.4.4.
And yes, main goal is to have more than 2GB in pandas_udf functions.
I've used ARROW_PRE_0_15_IPC_FORMAT=1 in my spark submit:
%sh
cd /home/zeppelin/code && \
export PYSPARK_DRIVER_PYTHON=/home/zeppelin/envs/env3/bin/python && \
export PYSPARK_PYTHON=./env3/bin/python && \
export ARROW_PRE_0_15_IPC_FORMAT=1 && \
spark-submit \
--master yarn \
--deploy-mode client \
--num-executors 5 \
--executor-cores 5 \
--driver-memory 8G \
--executor-memory 8G \
--conf spark.executor.memoryOverhead=4G \
--conf spark.driver.memoryOverhead=4G \
--archives /home/zeppelin/env3.tar.gz#env3 \
--jars "/opt/deltalake/delta-core_2.11-0.5.0.jar" \
--py-files jobs.zip,"/opt/deltalake/delta-core_2.11-0.5.0.jar" main.py \
--job temp
With pyarrow version 0.15.0 and dataset with size 172 mb in padas_udf I still got error number 2 (look at detail log in my previous big message).
Any suggestions?
Dmitry Kravchuk: I found out that this error relates to pyarrow version after 0.14.1 https://stackoverflow.com/questions/58458415/pandas-scalar-udf-failing-illegalargumentexception
How can I cure it? Upgrade spark?
Environment variable ARROW_PRE_0_15_IPC_FORMAT=1 didn't help either.
I've tried to use pyarrow version 2.0.0 but it's still throwing the java.lang.IllegalArgumentException exception.
UPD: I've cured this following prompt (look at udf function) using 172 mb dataset:
import pyspark
from pyspark.sql import functions as F, types as T
import pandas as pd
def analyze(spark):
pdf1 = pd.DataFrame(
[[1234567, 0.0, "abcdefghij", "2000-01-01T00:00:00.000Z"]],
columns=['df1_c1', 'df1_c2', 'df1_c3', 'df1_c4']
)
df1 = spark.createDataFrame(pd.concat([pdf1 for i in range(429)]).reset_index()).drop('index')
pdf2 = pd.DataFrame(
[[1234567, 0.0, "abcdefghijklmno", "2000-01-01", "abcdefghijklmno", "abcdefghijklmno"]],
columns=['df2_c1', 'df2_c2', 'df2_c3', 'df2_c4', 'df2_c5', 'df2_c6']
)
df2 = spark.createDataFrame(pd.concat([pdf2 for i in range(4899)]).reset_index()).drop('index')
df3 = df1.join(df2, df1['df1_c1'] == df2['df2_c1'], how='inner')
def myudf(df):
import os
os.environ["ARROW_PRE_0_15_IPC_FORMAT"] = "1"
return df
df4 = df3 \
.withColumn('df1_c1', F.col('df1_c1').cast(T.IntegerType())) \
.withColumn('df1_c2', F.col('df1_c2').cast(T.DoubleType())) \
.withColumn('df1_c3', F.col('df1_c3').cast(T.StringType())) \
.withColumn('df1_c4', F.col('df1_c4').cast(T.StringType())) \
.withColumn('df2_c1', F.col('df2_c1').cast(T.IntegerType())) \
.withColumn('df2_c2', F.col('df2_c2').cast(T.DoubleType())) \
.withColumn('df2_c3', F.col('df2_c3').cast(T.StringType())) \
.withColumn('df2_c4', F.col('df2_c4').cast(T.StringType())) \
.withColumn('df2_c5', F.col('df2_c5').cast(T.StringType())) \
.withColumn('df2_c6', F.col('df2_c6').cast(T.StringType()))
print(df4.printSchema())
udf = F.pandas_udf(df4.schema, F.PandasUDFType.GROUPED_MAP)(myudf)
df5 = df4.groupBy('df1_c1').apply(udf)
print('df5.count()', df5.count())
Dmitry Kravchuk: Finally I've tried pandas_udf with 1.72 gb dataset, pyarrow version 2.0.0 and ARROW_PRE_0_15_IPC_FORMAT=1 prompt - spark-submit returns error "OSError: Invalid IPC message: negative bodyLength" (look at detail log number 4 in my previous messages).
Any thoughts?
import pyspark
from pyspark.sql import functions as F, types as T
import pandas as pd
def analyze(spark):
pdf1 = pd.DataFrame(
[[1234567, 0.0, "abcdefghij", "2000-01-01T00:00:00.000Z"]],
columns=['df1_c1', 'df1_c2', 'df1_c3', 'df1_c4']
)
df1 = spark.createDataFrame(pd.concat([pdf1 for i in range(429)]).reset_index()).drop('index')
pdf2 = pd.DataFrame(
[[1234567, 0.0, "abcdefghijklmno", "2000-01-01", "abcdefghijklmno", "abcdefghijklmno"]],
columns=['df2_c1', 'df2_c2', 'df2_c3', 'df2_c4', 'df2_c5', 'df2_c6']
)
df2 = spark.createDataFrame(pd.concat([pdf2 for i in range(48993)]).reset_index()).drop('index')
df3 = df1.join(df2, df1['df1_c1'] == df2['df2_c1'], how='inner')
def myudf(df):
import os
os.environ["ARROW_PRE_0_15_IPC_FORMAT"] = "1"
return df
df4 = df3 \
.withColumn('df1_c1', F.col('df1_c1').cast(T.IntegerType())) \
.withColumn('df1_c2', F.col('df1_c2').cast(T.DoubleType())) \
.withColumn('df1_c3', F.col('df1_c3').cast(T.StringType())) \
.withColumn('df1_c4', F.col('df1_c4').cast(T.StringType())) \
.withColumn('df2_c1', F.col('df2_c1').cast(T.IntegerType())) \
.withColumn('df2_c2', F.col('df2_c2').cast(T.DoubleType())) \
.withColumn('df2_c3', F.col('df2_c3').cast(T.StringType())) \
.withColumn('df2_c4', F.col('df2_c4').cast(T.StringType())) \
.withColumn('df2_c5', F.col('df2_c5').cast(T.StringType())) \
.withColumn('df2_c6', F.col('df2_c6').cast(T.StringType()))
print(df4.printSchema())
udf = F.pandas_udf(df4.schema, F.PandasUDFType.GROUPED_MAP)(myudf)
df5 = df4.groupBy('df1_c1').apply(udf)
print('df5.count()', df5.count())
Liya Fan / @liyafan82:
ARROW_PRE_0_15_IPC_FORMAT
controls the write_legacy_ipc_format
flag, which controls if we write 0xffffffff in the message header.
However, for the Java and C++ implementations, the message length is represented with a 32-bit integer, regardless of the value of write_legacy_ipc_format
. So the ARROW_PRE_0_15_IPC_FORMAT
flag does not help with the problem.
To solve the problem, we need either reduce the message size, or remove the 2GB constraint of the message size (this may involve changing the specification, and the change to the implementation is also large).
Dmitry Kravchuk: [~fan_li_ya] do I need to create new issue for all this stuff?
Dmitry Kravchuk: [~fan_li_ya] okay.
Micah Kornfield / @emkornfield: This should be fixed as of Spark 3.1 which upgrades the java arrow dependency to 2.0
Creating this in Arrow project as the traceback seems to suggest this is an issue in Arrow. Continuation from the conversation on the https://mail-archives.apache.org/mod_mbox/arrow-dev/201903.mbox/%3CCAK7Z5T_mChuqhFDAF2U68dO=P_1Nst5AjjCRg0MExO5Kby9i-g@mail.gmail.com%3E
When I run a GROUPED_MAP UDF in Spark using PySpark, I run into the error:
as my dataset size starts increasing that I want to group on. Here is a reproducible code snippet where I can reproduce this. Note: My actual dataset is much larger and has many more unique IDs and is a valid usecase where I cannot simplify this groupby in any way. I have stripped out all the logic to make this example as simple as I could.
I have tried running this on Amazon EMR with Spark 2.3.1 and 20GB RAM per executor too.
Environment: Cloudera cdh5.13.3 Cloudera Spark 2.3.0.cloudera3 Reporter: Abdeali Kothari
Related issues:
Original Issue Attachments:
Note: This issue was originally created as ARROW-4890. Please see the migration documentation for further details.