databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
505 stars 227 forks source link

Is there any option to ignore invalid character when writing XML? #497

Closed eubnara closed 3 years ago

eubnara commented 3 years ago

Hello, I'm suffering from writing xml with some invisible characters.

I read data from mysql through jdbc and write as xml on hdfs.

But I met Caused by: com.ctc.wstx.exc.WstxIOException: Invalid white space character (0x2) in text to output (in xml 1.1, could output as a character entity).

error logs ``` org.apache.spark.SparkException: Job aborted. at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:100) at org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopDataset$1(PairRDDFunctions.scala:1090) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:388) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1088) at org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopFile$4(PairRDDFunctions.scala:1061) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:388) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1026) at org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopFile$3(PairRDDFunctions.scala:1008) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:388) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1007) at org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopFile$2(PairRDDFunctions.scala:964) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:388) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:962) at org.apache.spark.rdd.RDD.$anonfun$saveAsTextFile$2(RDD.scala:1552) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:388) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1552) at org.apache.spark.rdd.RDD.$anonfun$saveAsTextFile$1(RDD.scala:1538) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:388) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1538) at com.databricks.spark.xml.util.XmlFile$.saveAsXmlFile(XmlFile.scala:127) at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:102) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:90) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:122) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:121) at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:963) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:963) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:415) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:399) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:288) ... 40 elided Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 11.0 failed 4 times, most recent failure: Lost task 0.3 in stage 11.0 (TID 29, ac3f7x2012.bdp.bdata.ai, executor 1): org.apache.spark.SparkException: Task failed while writing rows at org.apache.spark.internal.io.SparkHadoopWriter$.executeTask(SparkHadoopWriter.scala:157) at org.apache.spark.internal.io.SparkHadoopWriter$.$anonfun$write$1(SparkHadoopWriter.scala:83) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: com.ctc.wstx.exc.WstxIOException: Invalid white space character (0x2) in text to output (in xml 1.1, could output as a character entity) at com.ctc.wstx.sw.BaseStreamWriter.writeCharacters(BaseStreamWriter.java:480) at com.sun.xml.txw2.output.DelegatingXMLStreamWriter.writeCharacters(DelegatingXMLStreamWriter.java:116) at com.sun.xml.txw2.output.IndentingXMLStreamWriter.writeCharacters(IndentingXMLStreamWriter.java:158) at com.databricks.spark.xml.parsers.StaxXmlGenerator$.writeElement$1(StaxXmlGenerator.scala:78) at com.databricks.spark.xml.parsers.StaxXmlGenerator$.writeChildElement$1(StaxXmlGenerator.scala:50) at com.databricks.spark.xml.parsers.StaxXmlGenerator$.writeChild$1(StaxXmlGenerator.scala:72) at com.databricks.spark.xml.parsers.StaxXmlGenerator$.$anonfun$apply$7(StaxXmlGenerator.scala:117) at com.databricks.spark.xml.parsers.StaxXmlGenerator$.$anonfun$apply$7$adapted(StaxXmlGenerator.scala:115) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at com.databricks.spark.xml.parsers.StaxXmlGenerator$.writeElement$1(StaxXmlGenerator.scala:115) at com.databricks.spark.xml.parsers.StaxXmlGenerator$.apply(StaxXmlGenerator.scala:142) at com.databricks.spark.xml.util.XmlFile$$anon$1.next(XmlFile.scala:103) at com.databricks.spark.xml.util.XmlFile$$anon$1.next(XmlFile.scala:87) at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at org.apache.spark.internal.io.SparkHadoopWriter$.$anonfun$executeTask$1(SparkHadoopWriter.scala:131) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411) at org.apache.spark.internal.io.SparkHadoopWriter$.executeTask(SparkHadoopWriter.scala:129) ... 9 more Caused by: java.io.IOException: Invalid white space character (0x2) in text to output (in xml 1.1, could output as a character entity) at com.ctc.wstx.api.InvalidCharHandler$FailingHandler.convertInvalidChar(InvalidCharHandler.java:56) at com.ctc.wstx.sw.XmlWriter.handleInvalidChar(XmlWriter.java:629) at com.ctc.wstx.sw.BufferingXmlWriter.writeCharacters(BufferingXmlWriter.java:494) at com.ctc.wstx.sw.BaseStreamWriter.writeCharacters(BaseStreamWriter.java:478) ... 27 more Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2059) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2008) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2007) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2007) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:973) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:973) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:973) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2239) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2188) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2177) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:775) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2120) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2152) at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:78) ... 99 more Caused by: org.apache.spark.SparkException: Task failed while writing rows at org.apache.spark.internal.io.SparkHadoopWriter$.executeTask(SparkHadoopWriter.scala:157) at org.apache.spark.internal.io.SparkHadoopWriter$.$anonfun$write$1(SparkHadoopWriter.scala:83) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449) ... 3 more Caused by: com.ctc.wstx.exc.WstxIOException: Invalid white space character (0x2) in text to output (in xml 1.1, could output as a character entity) at com.ctc.wstx.sw.BaseStreamWriter.writeCharacters(BaseStreamWriter.java:480) at com.sun.xml.txw2.output.DelegatingXMLStreamWriter.writeCharacters(DelegatingXMLStreamWriter.java:116) at com.sun.xml.txw2.output.IndentingXMLStreamWriter.writeCharacters(IndentingXMLStreamWriter.java:158) at com.databricks.spark.xml.parsers.StaxXmlGenerator$.writeElement$1(StaxXmlGenerator.scala:78) at com.databricks.spark.xml.parsers.StaxXmlGenerator$.writeChildElement$1(StaxXmlGenerator.scala:50) at com.databricks.spark.xml.parsers.StaxXmlGenerator$.writeChild$1(StaxXmlGenerator.scala:72) at com.databricks.spark.xml.parsers.StaxXmlGenerator$.$anonfun$apply$7(StaxXmlGenerator.scala:117) at com.databricks.spark.xml.parsers.StaxXmlGenerator$.$anonfun$apply$7$adapted(StaxXmlGenerator.scala:115) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at com.databricks.spark.xml.parsers.StaxXmlGenerator$.writeElement$1(StaxXmlGenerator.scala:115) at com.databricks.spark.xml.parsers.StaxXmlGenerator$.apply(StaxXmlGenerator.scala:142) at com.databricks.spark.xml.util.XmlFile$$anon$1.next(XmlFile.scala:103) at com.databricks.spark.xml.util.XmlFile$$anon$1.next(XmlFile.scala:87) at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at org.apache.spark.internal.io.SparkHadoopWriter$.$anonfun$executeTask$1(SparkHadoopWriter.scala:131) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411) at org.apache.spark.internal.io.SparkHadoopWriter$.executeTask(SparkHadoopWriter.scala:129) ... 9 more Caused by: java.io.IOException: Invalid white space character (0x2) in text to output (in xml 1.1, could output as a character entity) at com.ctc.wstx.api.InvalidCharHandler$FailingHandler.convertInvalidChar(InvalidCharHandler.java:56) at com.ctc.wstx.sw.XmlWriter.handleInvalidChar(XmlWriter.java:629) at com.ctc.wstx.sw.BufferingXmlWriter.writeCharacters(BufferingXmlWriter.java:494) at com.ctc.wstx.sw.BaseStreamWriter.writeCharacters(BaseStreamWriter.java:478) ... 27 more ```

I do some workaround as follows

// http://www.regular-expressions.info/unicode.html
var df = table("people")
.select(
    regexp_replace(col("problematic_col1"), "[\\p{Cc}]", "?").as("col1"),
    regexp_replace(col("problematic_col1"), "[\\p{Cc}]", "?").as("col2")
)

Is there any better way than this?

srowen commented 3 years ago

Weird, do you know what kind of characters are in the values you are writing? Looks like control codes. Sounds like they are not allowed in XML 1.0 even as entities. So that behavior is probably 'right' in that there is no right way to represent what you are outputting. Woodstox can be configured to replace invalid chars with something else but I don't know if that helps users in general as you probably don't want to throw them away - or, indeed, if you do, you can do that manually.

eubnara commented 3 years ago

@srowen Thanks for reply. In hex, it was just 00. If I write the same data in json, there is no error like this. If I cannot sure which row or column has invalid chars, I have to check whole rows and columns. Isn't it uncomfortable?

I think it would be helpful if there is an option to ignore/replace some invalid chars.

srowen commented 3 years ago

XML is not JSON and has different rules. There might be a way to tell the writer to use XML 1.1 rules which apparently has encodings for these characters. Or maybe a CDATA element is appropriate here, not sure. But what are you writing that has odd control codes in it - are you sure that's expected? Just checking if you're say trying to write some binary data here.

eubnara commented 3 years ago

@srowen I read data from mysql through jdbc. There are some invalid characters. However, I don't have permission to change it. I just can read it.

mysql data -> dataframe in spark -> xml in hdfs

I'm trying to write xml through spark-xml library.

It is not expected and I just want to ignore those invalid characters.

srowen commented 3 years ago

Are you sure it's text? I just want to make sure you're not trying to encode binary data, which will fail for other reasons. If you can't write these chars in XML, it sounds like you have to drop them, which you're doing. I think it's not wrong to say, do it in your code, because that's a pretty unusual situation if you're correct.

eubnara commented 3 years ago

@srowen I got it. Thanks for reply. It was not valid text. It is 00 in hex code. (Maybe just two zeros in binary? I don't know why data producer makes this wrongly.) I just want my spark application won't fail even though there are invalid characters as writing in other formats. (e.g. just text or json or parquet etc.) If it may violate xml standards, I understand and just do some workaround.

Again, thanks for your reply and advices.