ignoreCorruptFiles and GZIP corrupted xml files

slavokx commented 1 year ago

Hello All, recently i have come across issue with dealing with corrupted gzip files. It seems that it is not possible to skip or ignore them.

spark.conf.set("spark.sql.files.ignoreCorruptFiles", "true")
df = spark.read.format("com.databricks.spark.xml").option("rowTag", "data") \
        .option("mode", "PERMISSIVE")\
        .option("badRecordsPath", "/tmp/badRecordsPath").load("/input/data-with-corrupted-files/*.gz")
df.show()

always throws error

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 12) (x.x.x.x executor 0): java.io.IOException: incorrect header check

There are actually broken files and it is detected correctly, however it seems it is not possible to ignore such files.

Is it somehow possible to ignore problematic files ? If no, where would be proper place to implement support for expected behavior ? XmlInputFormat.scala ?

srowen commented 1 year ago

No I don't think it's possible. That error happens at a pretty low level, and I suppose the idea is that this is definitely a fatal error, not just a few bad inputs, so has to be fixed.

slavokx commented 1 year ago

Thanks for quick response!

Personally i would not consider it to be fatal error. Example from my use case: In source directory there is 1mil xml.gz files delivered from 1mil source systems - one of them is corrupted. The whole processing fails due to one problematic and malfunctioning source system.

My understanding is that idea of ignoreCorruptFiles is to handle such problems, e.g. if parquet file is broken it can be ignored, i was hoping for same possibility with spark-xml and its XmlInputFormat.

Exception would be ideally catch and reported via _corrupt_record. ( e.g. in case one record equals one file)

srowen commented 1 year ago

Fair enough, I think the change or issue is deeper down in Spark. I don't know how to address this.

databricks / spark-xml

ignoreCorruptFiles and GZIP corrupted xml files #639