databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
499 stars 226 forks source link

attributes are ignored when specifying a schema #375

Closed Jeongmin-Lee closed 4 years ago

Jeongmin-Lee commented 5 years ago

I want to get a nested xml as String. But, Attributes are ignored. this is sample xml.

<record>
  <title>hihihihi &amp; tttt  </title>
  <info type="text" word="apps">
    <test1>test_</test1>
    <test2>tttt</test2>
    <test3 ab="test" xvf="sample"><aa>sample</aa></test3>
  </info>
</record>
val schema = StructType(Array(
  StructField("title", StringType),
  StructField("info", StringType)
))

val sample = spark.read.format("com.databricks.spark.xml").option("rowTag", "record").schema(schema).load("temp/sample.xml")

sample.show(false)

result,

+-----------------+---------------------------------------------------------------------+
|title            |info                                                                 |
+-----------------+---------------------------------------------------------------------+
|hihihihi & tttt  |<test1>test_</test1><test2>tttt</test2><test3><aa>sample</aa></test3>|
+-----------------+---------------------------------------------------------------------+

info value expected,

<test1>test_</test1><test2>tttt</test2><test3 ab="test" xvf="sample"><aa>sample</aa></test3>

How can I get nested xml text completely?

srowen commented 5 years ago

Duplicate of #340 , or related I think

srowen commented 5 years ago

Oh, aren't you missing the attributes from the schema you provided? the result looks correct given the schema

Jeongmin-Lee commented 5 years ago

Specifying all the schemas is not a problem. However, my purpose is to extract nested xml text. To do this, I have specified a schema as shown below.

...
StructField("info", StringType)
...

But, when extracting xml text nested in the info tag, attributes of the test3 tag are not visible.

expected this.

<test3 ab="test" xvf="sample"><aa>sample</aa></test3>

but result,

<test3><aa>sample</aa></test3>
srowen commented 5 years ago

Try inferring the schema to see what schema would correspond to extracting the attribute values. You don't have schema elements for those attributes and I think that's the issue.

Magudeswaran-R commented 5 years ago

Hi Team..i am also facing the same issue..where in the same worked as expected by @Jeongmin-Lee in the older version..it will be very helpful if there is any other way to fix it

Update:

sample

the 0.4.1 release gave the above result when the xml is parsed

tolomaus commented 4 years ago

Hi,

I encountered the same problem after migrating from 0.4.1 to 0.9.0.

@srowen just to clarify the issue: my xml files contain "dynamic" tags like zipCode_1000:

<address>
  <zipCode_1000 ab="test">1000</zipCode_1000>
</address>

so instead of adding a line for each possible zip code to the schema, instead I specify the parent tag address as a string:

{
  "type": "struct",
  "fields": [
    {
      "name": "address",
      "type": "string",
      "nullable": true,
      "metadata": {}
    }
  ]
}

Then I parse the string that contains the xml structure manually.

The problem after migrating is that the attributes of the zipCode_1000 tag are lost in the string value, so the address tag now contains the string <zipCode_1000>1000</zipCode_1000> instead of <zipCode_1000 ab="test">1000</zipCode_1000>

srowen commented 4 years ago

I don't understand that. You would have to have a struct called zipCode_1000 with fields ab and _VALUE, not what you describe here. Try inferring the schema on your example to see it.

tolomaus commented 4 years ago

In theory you are correct.

But to avoid the need to have a schema with thousands of zipCode_1000, zipCode_1001, etc I simply used 'string' instead of 'struct' s the type of the address tag. Kind of hacky I admit but it worked fine in 0.4.1: the whole xml structure inside the address tag - including the attributes - was returned as a string. In 0.9.0 the attributes are missing

srowen commented 4 years ago

OK I get it, you want the XML back as a string. Now I see what the issue is, it's not what I thought, and I see what's happening. The parser kind of has to traverse the XML DOM, and when it realizes that you want some of it back as a string it reconstructs it. It misses the attributes in this case. I can fix that.

srowen commented 4 years ago

See https://github.com/databricks/spark-xml/pull/469