databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
500 stars 226 forks source link

Task failed while writing rows. #600

Closed xifan987 closed 1 year ago

xifan987 commented 2 years ago

My code just like:

spark = SparkSession \
        .builder \
        .enableHiveSupport() \
        .getOrCreate()
    df = spark.read.format('xml').options(rowTag='page').load(xml_file, schema=xml_schema)

it can work when i load most xml files(20g~40g), but only a few tasks in a few files will failed by:

org.apache.spark.SparkException: Task failed while writing rows.
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:484)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$37(FileFormatWriter.scala:360)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:130)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:476)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1514)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:479)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException
    at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
    at java.nio.charset.CharsetEncoder.encode(CharsetEncoder.java:810)
    at org.apache.hadoop.io.Text.encode(Text.java:455)
    at org.apache.hadoop.io.Text.set(Text.java:198)
    at com.databricks.spark.xml.XmlRecordReader.next(XmlInputFormat.scala:184)
    at com.databricks.spark.xml.XmlRecordReader.nextKeyValue(XmlInputFormat.scala:165)
    at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:251)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:741)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
    at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:156)
    at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:148)
    at scala.collection.Iterator.foreach(Iterator.scala:941)
    at scala.collection.Iterator.foreach$(Iterator.scala:941)
    at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:148)
    at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:295)
    at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:607)
    at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:383)
    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2069)
    at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:218)
xifan987 commented 2 years ago

It seems some row contents too big that exceeded 2g, can't read by java string.

srowen commented 2 years ago

That's my guess too - one single row exceeds 2GB. Do you have huge single rows, or possibly an unclosed row tag? this wouldn't work, just not something reasonable to handle.