databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
499 stars 226 forks source link

parse XML without the default AttributePrefix "_" in PySpark #673

Closed schneifejan closed 8 months ago

schneifejan commented 8 months ago

Hi, I try to create a DataFrame without the default AttributePrefix _.

Given the following XML sample data:

<?xml version="1.0" encoding="ISO-8859-1"?>

<OUT foo="bar">
    <bar>
        <derp>2017-12-14T09:13:13</derp>
        <example>myvalue</example>
        <tag>tag</tag>
    </bar>
</OUT>

If I create a spark DataFrame without the AttributePrefix option:

spark = SparkSession.builder \
    .master("local[2]") \
    .appName("test-spark-xml") \
    .config("spark.jars", ".spark-xml_2.12-0.17.0.jar") \
    .getOrCreate()

df = spark.read \
    .format("xml") \
    .option("rowTag", "OUT") \
    .option("charset", "iso-8859-1") \
    .load(path)

df.show(1, False)

the XML sample data is parsed as expected:

+----+-----------------------------------+
|_foo|bar                                |
+----+-----------------------------------+
|bar |{2017-12-14 10:13:13, myvalue, tag}|
+----+-----------------------------------+

However, if I create a spark DataFrame with the AttributePrefix option set to "":

# add attribute prefix
df = spark.read \
    .format("xml") \
    .option("rowTag", "OUT") \
    .option("AttributePrefix", "") \
    .option("charset", "iso-8859-1") \
    .load(path)

df.show(1, False)

the child-tags of bar are all NULL.

+------------------+---+
|bar               |foo|
+------------------+---+
|{NULL, NULL, NULL}|bar|
+------------------+---+

Does anybody have an idea what I'm doing wrong? Any help is highly appreciated 👍

Kind regards Felix

versions

Python 3.9 spark-xml: 0.17.0 pyspark 3.5.0

srowen commented 8 months ago

I think this only works if there are only attributes, no children, if you have attributePrefix == "". The schema becomes ambiguous when it goes back to read attributes vs children. I would just not set this option, and rename attribute fields as you see fit.

It's probably hard but possible to 'fix' the behavior in the code, but this library is not developed anymore now that it's in Spark. I'd look at a PR but otherwise not going to change the beahvior.

schneifejan commented 8 months ago

Thanks @srowen for explaining the behavior and the suggested workaround. As the library is not developed anymore I'd refrain from working on a PR. I'll close the issue then :)