parse XML without the default AttributePrefix "_" in PySpark

schneifejan commented 8 months ago

Hi, I try to create a DataFrame without the default AttributePrefix _.

Given the following XML sample data:

<?xml version="1.0" encoding="ISO-8859-1"?>

<OUT foo="bar">
    <bar>
        <derp>2017-12-14T09:13:13</derp>
        <example>myvalue</example>
        <tag>tag</tag>
    </bar>
</OUT>

If I create a spark DataFrame without the AttributePrefix option:

spark = SparkSession.builder \
    .master("local[2]") \
    .appName("test-spark-xml") \
    .config("spark.jars", ".spark-xml_2.12-0.17.0.jar") \
    .getOrCreate()

df = spark.read \
    .format("xml") \
    .option("rowTag", "OUT") \
    .option("charset", "iso-8859-1") \
    .load(path)

df.show(1, False)

the XML sample data is parsed as expected:

+----+-----------------------------------+
|_foo|bar                                |
+----+-----------------------------------+
|bar |{2017-12-14 10:13:13, myvalue, tag}|
+----+-----------------------------------+

However, if I create a spark DataFrame with the AttributePrefix option set to "":

# add attribute prefix
df = spark.read \
    .format("xml") \
    .option("rowTag", "OUT") \
    .option("AttributePrefix", "") \
    .option("charset", "iso-8859-1") \
    .load(path)

df.show(1, False)

the child-tags of bar are all NULL.

+------------------+---+
|bar               |foo|
+------------------+---+
|{NULL, NULL, NULL}|bar|
+------------------+---+

Does anybody have an idea what I'm doing wrong? Any help is highly appreciated 👍

Kind regards Felix

versions

Python 3.9 spark-xml: 0.17.0 pyspark 3.5.0

srowen commented 8 months ago

I think this only works if there are only attributes, no children, if you have attributePrefix == "". The schema becomes ambiguous when it goes back to read attributes vs children. I would just not set this option, and rename attribute fields as you see fit.

It's probably hard but possible to 'fix' the behavior in the code, but this library is not developed anymore now that it's in Spark. I'd look at a PR but otherwise not going to change the beahvior.

schneifejan commented 8 months ago

Thanks @srowen for explaining the behavior and the suggested workaround. As the library is not developed anymore I'd refrain from working on a PR. I'll close the issue then :)

databricks / spark-xml

parse XML without the default AttributePrefix "_" in PySpark #673

versions