[Python] Spark+Arrow Grouped pandas UDAF - read length must be positive or -1

asfimport commented 5 years ago

Creating this in Arrow project as the traceback seems to suggest this is an issue in Arrow. Continuation from the conversation on the https://mail-archives.apache.org/mod_mbox/arrow-dev/201903.mbox/%3CCAK7Z5T_mChuqhFDAF2U68dO=P_1Nst5AjjCRg0MExO5Kby9i-g@mail.gmail.com%3E

When I run a GROUPED_MAP UDF in Spark using PySpark, I run into the error:


  File "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.458809/lib/spark2/python/lib/pyspark.zip/pyspark/serializers.py", line 279, in load_stream
    for batch in reader:
  File "pyarrow/ipc.pxi", line 265, in __iter__
  File "pyarrow/ipc.pxi", line 281, in pyarrow.lib._RecordBatchReader.read_next_batch
  File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: read length must be positive or -1

as my dataset size starts increasing that I want to group on. Here is a reproducible code snippet where I can reproduce this. Note: My actual dataset is much larger and has many more unique IDs and is a valid usecase where I cannot simplify this groupby in any way. I have stripped out all the logic to make this example as simple as I could.


import os

os.environ['PYSPARK_SUBMIT_ARGS'] = '--executor-memory 9G pyspark-shell'
import findspark
findspark.init()
import pyspark
from pyspark.sql import functions as F, types as T
import pandas as pd

spark = pyspark.sql.SparkSession.builder.getOrCreate()

pdf1 = pd.DataFrame(
    [[1234567, 0.0, "abcdefghij", "2000-01-01T00:00:00.000Z"]],
    columns=['df1_c1', 'df1_c2', 'df1_c3', 'df1_c4']
)
df1 = spark.createDataFrame(pd.concat([pdf1 for i in range(429)]).reset_index()).drop('index')

pdf2 = pd.DataFrame(
    [[1234567, 0.0, "abcdefghijklmno", "2000-01-01", "abcdefghijklmno", "abcdefghijklmno"]],
    columns=['df2_c1', 'df2_c2', 'df2_c3', 'df2_c4', 'df2_c5', 'df2_c6']
)
df2 = spark.createDataFrame(pd.concat([pdf2 for i in range(48993)]).reset_index()).drop('index')
df3 = df1.join(df2, df1['df1_c1'] == df2['df2_c1'], how='inner')

def myudf(df):
    return df

df4 = df3
udf = F.pandas_udf(df4.schema, F.PandasUDFType.GROUPED_MAP)(myudf)

df5 = df4.groupBy('df1_c1').apply(udf)
print('df5.count()', df5.count())

# df5.write.parquet('/tmp/temp.parquet', mode='overwrite')

I have tried running this on Amazon EMR with Spark 2.3.1 and 20GB RAM per executor too.

Environment: Cloudera cdh5.13.3 Cloudera Spark 2.3.0.cloudera3 Reporter: Abdeali Kothari

Related issues:

GroupedData Pandas UDF 2Gb limit (relates to)
Original Issue Attachments:
image-2019-07-04-12-03-57-002.png
Task retry fails.png

_{Note: This issue was originally created as ARROW-4890. Please see the migration documentation for further details.}

asfimport commented 5 years ago

Arvind Ravish: I get the same thing when running the repro code in Databricks. Can we get some description about what the error means?

Plan looks like this

== Physical Plan == *(7) HashAggregate(keys=[], functions=[finalmerge_count(merge count#506L) AS count(1)#502L|#506L) AS count(1)#502L]) +- Exchange SinglePartition +- *(6) HashAggregate(keys=[], functions=[partial_count(1) AS count#506L|#506L]) +- *(6) Project +- FlatMapGroupsInPandas [df1_c1#72L|#72L], myudf(df1_c1#72L, df1_c2#73, df1_c3#74, df1_c4#75, df2_c1#108L, df2_c2#109, df2_c3#110, df2_c4#111, df2_c5#112, df2_c6#113), [df1_c1#264L, df1_c2#265, df1_c3#266, df1_c4#267, df2_c1#268L, df2_c2#269, df2_c3#270, df2_c4#271, df2_c5#272, df2_c6#273|#264L, df1_c2#265, df1_c3#266, df1_c4#267, df2_c1#268L, df2_c2#269, df2_c3#270, df2_c4#271, df2_c5#272, df2_c6#273] +- *(5) Project [df1_c1#72L, df1_c1#72L, df1_c2#73, df1_c3#74, df1_c4#75, df2_c1#108L, df2_c2#109, df2_c3#110, df2_c4#111, df2_c5#112, df2_c6#113|#72L, df1_c1#72L, df1_c2#73, df1_c3#74, df1_c4#75, df2_c1#108L, df2_c2#109, df2_c3#110, df2_c4#111, df2_c5#112, df2_c6#113] +- *(5) SortMergeJoin [df1_c1#72L|#72L], [df2_c1#108L|#108L], Inner :- *(2) Sort [df1_c1#72L ASC NULLS FIRST|#72L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(df1_c1#72L, 200) : +- *(1) Project [df1_c1#72L, df1_c2#73, df1_c3#74, df1_c4#75|#72L, df1_c2#73, df1_c3#74, df1_c4#75] : +- *(1) Filter isnotnull(df1_c1#72L) : +- *(1) Scan ExistingRDD[index#71L,df1_c1#72L,df1_c2#73,df1_c3#74,df1_c4#75|#71L,df1_c1#72L,df1_c2#73,df1_c3#74,df1_c4#75] +- *(4) Sort [df2_c1#108L ASC NULLS FIRST|#108L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(df2_c1#108L, 200) +- *(3) Project [df2_c1#108L, df2_c2#109, df2_c3#110, df2_c4#111, df2_c5#112, df2_c6#113|#108L, df2_c2#109, df2_c3#110, df2_c4#111, df2_c5#112, df2_c6#113] +- *(3) Filter isnotnull(df2_c1#108L) +- *(3) Scan ExistingRDD[index#107L,df2_c1#108L,df2_c2#109,df2_c3#110,df2_c4#111,df2_c5#112,df2_c6#113|#107L,df2_c1#108L,df2_c2#109,df2_c3#110,df2_c4#111,df2_c5#112,df2_c6#113]

An error occurred while calling o1471.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 93 in stage 39.0 failed 4 times, most recent failure: Lost task 93.3 in stage 39.0 (TID 1261, 10.139.64.6, executor 1): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/databricks/spark/python/pyspark/worker.py", line 403, in main process() File "/databricks/spark/python/pyspark/worker.py", line 398, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/databricks/spark/python/pyspark/serializers.py", line 296, in dump_stream for series in iterator: File "/databricks/spark/python/pyspark/serializers.py", line 319, in load_stream for batch in generator(): File "/databricks/spark/python/pyspark/serializers.py", line 314, in generator for batch in reader: File "pyarrow/ipc.pxi", line 268, in iter (/arrow/python/build/temp.linux-x86_64-3.5/lib.cxx:70278) File "pyarrow/ipc.pxi", line 284, in pyarrow.lib._RecordBatchReader.read_next_batch (/arrow/python/build/temp.linux-x86_64-3.5/lib.cxx:70534) File "pyarrow/error.pxi", line 79, in pyarrow.lib.check_status (/arrow/python/build/temp.linux-x86_64-3.5/lib.cxx:8345) pyarrow.lib.ArrowIOError: read length must be positive or -1 at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:490) at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:172) at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:444) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.agg_doAggregateWithoutKey_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:634) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) at org.apache.spark.scheduler.Task.doRunTask(Task.scala:139) at org.apache.spark.scheduler.Task.run(Task.scala:112) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1432) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

asfimport commented 5 years ago

Micah Kornfield / @emkornfield: This seems to indicate that the stream of data being passed to Arrow isn't in the correct format (or Arrow is misinterpreting it).