databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
499 stars 226 forks source link

ignoreSurroundingSpaces not working - Pyspark #670

Closed DeemoONeill closed 9 months ago

DeemoONeill commented 9 months ago

I've given Spark-XML the ignoreSurroundingSpaces option, and the deprecated withIgnoreSurroundingSpaces option, and both fail to strip whitespace.

df = (
spark.read.format("xml")
.options(rowTag="ns:Activity", ignoreNamespace=True, withIgnoreSurroundingSpaces=True)
.load("file.xml")
)

with the xml looking like

<?xml version="1.0" encoding="utf-8"?>
<ns:Extract>
  <ns:Activity>
    <ns:OrgId>   Org    </ns:OrgId>
    <ns:Dep>   DEP  </ns:Dep>
  </ns:Activity>
</ns:Extract>

giving the dataframe

+--------+----------+
|     Dep|     OrgId|
+--------+----------+
|   DEP  |   Org    |
+--------+----------+

with spaces still in it.

I can run a trim on this but i do have nested fields that i would need to do an explode > trim > collectList on.

Looking at the source code it looks like this should do a trim on each cell, so i'm not sure what's going on here. is it an interaction with the ignoreNamespace and rowTag perhaps?

DeemoONeill commented 9 months ago

version com.databricks:spark-xml_2.12:0.16.0,io.delta:delta-core_2.12:1.2.1 for reference

srowen commented 9 months ago

The parameter is called ignoreSurroundingSpaces not withIgnoreSurroundingSpaces

DeemoONeill commented 9 months ago

I've done both. ignoreSurroundingSpaces doesn't work either.

srowen commented 9 months ago

I can't reproduce that, works as expected for me on your example. Double check and show your actual code that definitely uses ignoreSurroundingSpaces

DeemoONeill commented 9 months ago
from pyspark.sql import SparkSession

with open("file.xml", "w") as f:
    f.write("""<?xml version="1.0" encoding="utf-8"?>
<ns:Extract>
  <ns:Activity>
    <ns:OrgId>   Org    </ns:OrgId>
    <ns:Dep>   DEP  </ns:Dep>
  </ns:Activity>
</ns:Extract>""")

spark = SparkSession.builder.config("spark.jars.packages", "com.databricks:spark-xml_2.12:0.16.0").getOrCreate()

df = spark.read.format("xml").options(rowTag="ns:Activity", ignoreNamespace=True, ignoreSurroundingSpaces=True).load("file.xml")

df.show()
+--------+----------+
|     Dep|     OrgId|
+--------+----------+
|   DEP  |   Org    |
+--------+----------+
DeemoONeill commented 9 months ago

ahh, just seen this: https://github.com/databricks/spark-xml/pull/637 are all ignoreSurroundingSpaces methods broken in 0.16.0?

srowen commented 9 months ago

Oh, this is 0.16.0. This was already fixed: https://github.com/databricks/spark-xml/issues/636 Try 0.17.0, that is working

DeemoONeill commented 9 months ago

Ahh, i don't think I can upgade the version on our platform. looks like i'll have to keep doing the manual trim until they pull their finger out