apache / arrow-java

Official Java implementation of Apache Arrow
https://arrow.apache.org/
4 stars 4 forks source link

[Java] Spark job fails due to arrow buf limitation #342

Open asfimport opened 2 years ago

asfimport commented 2 years ago

 

Hello,

Groupby + applyinPandas results in following error. We need some parameter to tune buffer size.

 


Caused by: java.lang.IndexOutOfBoundsException: index: 0, length: 1073741824 (expected: range(0, 0)) at io.netty.buffer.ArrowBuf.checkIndex(ArrowBuf.java:716) at io.netty.buffer.ArrowBuf.setBytes(ArrowBuf.java:954) at org.apache.arrow.vector.BaseVariableWidthVector.reallocDataBuffer(BaseVariableWidthVector.java:508) at org.apache.arrow.vector.BaseVariableWidthVector.handleSafe(BaseVariableWidthVector.java:1239) at org.apache.arrow.vector.BaseVariableWidthVector.setSafe(BaseVariableWidthVector.java:1066) at org.apache.spark.sql.execution.arrow.StringWriter.setValue(ArrowWriter.scala:287) at org.apache.spark.sql.execution.arrow.ArrowFieldWriter.write(ArrowWriter.scala:151) at org.apache.spark.sql.execution.arrow.ArrowWriter.write(ArrowWriter.scala:105) at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.$anonfun$writeIteratorToStream$1(ArrowPythonRunner.scala:100) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581) at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.writeIteratorToStream(ArrowPythonRunner.scala:122) at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:478) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2146) at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:270)

Reporter: Shubham Chhabra

Note: This issue was originally created as ARROW-15983. Please see the migration documentation for further details.

tklinchik commented 1 year ago

Any updates or workaround on this? I'm unable to read any arrow files produced by python due to this issue.


    at org.apache.arrow.memory.ArrowBuf.checkIndex(ArrowBuf.java:701)
    at org.apache.arrow.memory.ArrowBuf.setBytes(ArrowBuf.java:955)
    at org.apache.arrow.vector.BaseFixedWidthVector.reAlloc(BaseFixedWidthVector.java:451)
    at org.apache.arrow.vector.BaseFixedWidthVector.setValueCount(BaseFixedWidthVector.java:732)
    at org.apache.arrow.vector.VectorSchemaRoot.setRowCount(VectorSchemaRoot.java:240)
    at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:86)
    at org.apache.arrow.vector.ipc.ArrowReader.loadRecordBatch(ArrowReader.java:220)
    at org.apache.arrow.vector.ipc.ArrowFileReader.loadNextBatch(ArrowFileReader.java:166)
    at org.apache.arrow.vector.ipc.ArrowFileReader.loadRecordBatch(ArrowFileReader.java:197)```
westonpace commented 1 year ago

I've relabeled this as Java and not Python. Though I suppose it is an integration. It looks like the Java reader is not able to load a file produced by Python? Can you supply an example file that fails with this error?