databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
500 stars 226 forks source link

Failure to Parse xml file with Error "Found duplicate column(s) in the data schema: `_value`" #606

Closed brcopeland closed 1 year ago

brcopeland commented 1 year ago

I am attempting to parse data using pyspark from UniProt (small example file available here) which has a schema here.

My command with error stack trace is:

p53 = spark.read.format("xml").option("rowValidationXSDPath", "uniprot.xsd").option("rowTag", "entry").load("P04637.xml")
---------------------------------------------------------------------------
AnalysisException                         Traceback (most recent call last)
<ipython-input-29-bc4ea9d5bfce> in <module>
----> 1 p53 = spark.read.format("xml").option("rowValidationXSDPath", "uniprot.xsd").option("rowTag", "entry").load("P04637.xml")

~/miniconda3/lib/python3.9/site-packages/pyspark/sql/readwriter.py in load(self, path, format, schema, **options)
    202         self.options(**options)
    203         if isinstance(path, str):
--> 204             return self._df(self._jreader.load(path))
    205         elif path is not None:
    206             if type(path) != list:

~/miniconda3/lib/python3.9/site-packages/py4j/java_gateway.py in __call__(self, *args)
   1302
   1303         answer = self.gateway_client.send_command(command)
-> 1304         return_value = get_return_value(
   1305             answer, self.gateway_client, self.target_id, self.name)
   1306

~/miniconda3/lib/python3.9/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
    115                 # Hide where the exception came from that shows a non-Pythonic
    116                 # JVM exception message.
--> 117                 raise converted from None
    118             else:
    119                 raise

AnalysisException: Found duplicate column(s) in the data schema: `_value`

Any suggestions on what I can do to either fix or debug this?

srowen commented 1 year ago

You'll have to show the XML!

brcopeland commented 1 year ago

I used the files in the links I provided; the first is the XML and the second is the XSD.

srowen commented 1 year ago

Ah sorry I read right past that. My initial guess is that you are getting a field called "_value" in two ways. One, that is the default column name used to represent the value inside elements with no child XML, but a string inside, and also containing attributes. The second is from 'value=' attributes, which map to a col called "_value".

To test, try just setting valueTag to something else like _tag_value or whatever you want, see if that disambiguates

brcopeland commented 1 year ago

Thank you for your help; I thought it might have something to do with that but had been confused (still am, new to XML). Here's what I get now:

p53 = spark.read.format("xml").option("rowValidationXSDPath", "uniprot.xsd").option("rowTag", "entry").option("valueTag", "_tag_value").load("P04637.xml")

p53.printSchema()
root
 |-- _created: date (nullable = true)
 |-- _dataset: string (nullable = true)
 |-- _modified: date (nullable = true)
 |-- _version: long (nullable = true)
 |-- _xmlns: string (nullable = true)
 |-- accession: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- comment: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _evidence: string (nullable = true)
 |    |    |-- _name: string (nullable = true)
 |    |    |-- _type: string (nullable = true)
 |    |    |-- cofactor: struct (nullable = true)
 |    |    |    |-- _evidence: string (nullable = true)
 |    |    |    |-- dbReference: struct (nullable = true)
 |    |    |    |    |-- _id: string (nullable = true)
 |    |    |    |    |-- _tag_value: string (nullable = true)
 |    |    |    |    |-- _type: string (nullable = true)
 |    |    |    |-- name: string (nullable = true)
 |    |    |-- disease: struct (nullable = true)
 |    |    |    |-- _id: string (nullable = true)
 |    |    |    |-- acronym: string (nullable = true)
 |    |    |    |-- dbReference: struct (nullable = true)
 |    |    |    |    |-- _id: long (nullable = true)
 |    |    |    |    |-- _tag_value: string (nullable = true)
 |    |    |    |    |-- _type: string (nullable = true)
 |    |    |    |-- description: string (nullable = true)
 |    |    |    |-- name: string (nullable = true)
 |    |    |-- event: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- _tag_value: string (nullable = true)
 |    |    |    |    |-- _type: string (nullable = true)
 |    |    |-- experiments: long (nullable = true)
 |    |    |-- interactant: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- _intactId: string (nullable = true)
 |    |    |    |    |-- dbReference: struct (nullable = true)
 |    |    |    |    |    |-- _id: string (nullable = true)
 |    |    |    |    |    |-- _tag_value: string (nullable = true)
 |    |    |    |    |    |-- _type: string (nullable = true)
 |    |    |    |    |-- id: string (nullable = true)
 |    |    |    |    |-- label: string (nullable = true)
 |    |    |-- isoform: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- id: string (nullable = true)
 |    |    |    |    |-- name: array (nullable = true)
 |    |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |    |-- sequence: struct (nullable = true)
 |    |    |    |    |    |-- _ref: string (nullable = true)
 |    |    |    |    |    |-- _tag_value: string (nullable = true)
 |    |    |    |    |    |-- _type: string (nullable = true)
 |    |    |-- link: struct (nullable = true)
 |    |    |    |-- _tag_value: string (nullable = true)
 |    |    |    |-- _uri: string (nullable = true)
 |    |    |-- molecule: string (nullable = true)
 |    |    |-- organismsDiffer: boolean (nullable = true)
 |    |    |-- subcellularLocation: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- location: array (nullable = true)
 |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |-- _evidence: string (nullable = true)
 |    |    |    |    |    |    |-- _tag_value: string (nullable = true)
 |    |    |-- text: struct (nullable = true)
 |    |    |    |-- _evidence: string (nullable = true)
 |    |    |    |-- _tag_value: string (nullable = true)
 |-- dbReference: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _id: string (nullable = true)
 |    |    |-- _tag_value: string (nullable = true)
 |    |    |-- _type: string (nullable = true)
 |    |    |-- molecule: struct (nullable = true)
 |    |    |    |-- _id: string (nullable = true)
 |    |    |    |-- _tag_value: string (nullable = true)
 |    |    |-- property: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- _tag_value: string (nullable = true)
 |    |    |    |    |-- _type: string (nullable = true)
 |    |    |    |    |-- _value: string (nullable = true)
 |-- evidence: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _key: long (nullable = true)
 |    |    |-- _tag_value: string (nullable = true)
 |    |    |-- _type: string (nullable = true)
 |    |    |-- source: struct (nullable = true)
 |    |    |    |-- _ref: long (nullable = true)
 |    |    |    |-- _tag_value: string (nullable = true)
 |    |    |    |-- dbReference: struct (nullable = true)
 |    |    |    |    |-- _id: string (nullable = true)
 |    |    |    |    |-- _tag_value: string (nullable = true)
 |    |    |    |    |-- _type: string (nullable = true)
 |-- feature: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _description: string (nullable = true)
 |    |    |-- _evidence: string (nullable = true)
 |    |    |-- _id: string (nullable = true)
 |    |    |-- _type: string (nullable = true)
 |    |    |-- ligand: struct (nullable = true)
 |    |    |    |-- dbReference: struct (nullable = true)
 |    |    |    |    |-- _id: string (nullable = true)
 |    |    |    |    |-- _tag_value: string (nullable = true)
 |    |    |    |    |-- _type: string (nullable = true)
 |    |    |    |-- name: string (nullable = true)
 |    |    |-- location: struct (nullable = true)
 |    |    |    |-- begin: struct (nullable = true)
 |    |    |    |    |-- _position: long (nullable = true)
 |    |    |    |    |-- _tag_value: string (nullable = true)
 |    |    |    |-- end: struct (nullable = true)
 |    |    |    |    |-- _position: long (nullable = true)
 |    |    |    |    |-- _tag_value: string (nullable = true)
 |    |    |    |-- position: struct (nullable = true)
 |    |    |    |    |-- _position: long (nullable = true)
 |    |    |    |    |-- _tag_value: string (nullable = true)
 |    |    |-- original: string (nullable = true)
 |    |    |-- variation: string (nullable = true)
 |-- gene: struct (nullable = true)
 |    |-- name: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- _tag_value: string (nullable = true)
 |    |    |    |-- _type: string (nullable = true)
 |-- keyword: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _id: string (nullable = true)
 |    |    |-- _tag_value: string (nullable = true)
 |-- name: string (nullable = true)
 |-- organism: struct (nullable = true)
 |    |-- dbReference: struct (nullable = true)
 |    |    |-- _id: long (nullable = true)
 |    |    |-- _tag_value: string (nullable = true)
 |    |    |-- _type: string (nullable = true)
 |    |-- lineage: struct (nullable = true)
 |    |    |-- taxon: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |-- name: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- _tag_value: string (nullable = true)
 |    |    |    |-- _type: string (nullable = true)
 |-- protein: struct (nullable = true)
 |    |-- alternativeName: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- fullName: string (nullable = true)
 |    |-- recommendedName: struct (nullable = true)
 |    |    |-- fullName: string (nullable = true)
 |-- proteinExistence: struct (nullable = true)
 |    |-- _tag_value: string (nullable = true)
 |    |-- _type: string (nullable = true)
 |-- reference: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _evidence: long (nullable = true)
 |    |    |-- _key: long (nullable = true)
 |    |    |-- citation: struct (nullable = true)
 |    |    |    |-- _date: string (nullable = true)
 |    |    |    |-- _db: string (nullable = true)
 |    |    |    |-- _first: string (nullable = true)
 |    |    |    |-- _last: string (nullable = true)
 |    |    |    |-- _name: string (nullable = true)
 |    |    |    |-- _type: string (nullable = true)
 |    |    |    |-- _volume: long (nullable = true)
 |    |    |    |-- authorList: struct (nullable = true)
 |    |    |    |    |-- consortium: struct (nullable = true)
 |    |    |    |    |    |-- _name: string (nullable = true)
 |    |    |    |    |    |-- _tag_value: string (nullable = true)
 |    |    |    |    |-- person: array (nullable = true)
 |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |-- _name: string (nullable = true)
 |    |    |    |    |    |    |-- _tag_value: string (nullable = true)
 |    |    |    |-- dbReference: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- _id: string (nullable = true)
 |    |    |    |    |    |-- _tag_value: string (nullable = true)
 |    |    |    |    |    |-- _type: string (nullable = true)
 |    |    |    |-- title: string (nullable = true)
 |    |    |-- scope: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- source: struct (nullable = true)
 |    |    |    |-- tissue: array (nullable = true)
 |    |    |    |    |-- element: string (containsNull = true)
 |-- sequence: struct (nullable = true)
 |    |-- _checksum: string (nullable = true)
 |    |-- _length: long (nullable = true)
 |    |-- _mass: long (nullable = true)
 |    |-- _modified: date (nullable = true)
 |    |-- _tag_value: string (nullable = true)
 |    |-- _version: long (nullable = true)

So the problem appears to be resolved; I see that dbReference.property.element has both _tag_value and _value. Just trying to understand the nature of the problem; there are e.g. dbReference records in the XML like:

<dbReference type="PDB" id="7RM4">
    <property type="method" value="X-ray"/>
    <property type="resolution" value="3.33 A"/>
    <property type="chains" value="C/H/M/R=168-176"/>
  </dbReference>

so that from the tag value="X-ray" this gets assigned to column _value with value "X-ray" and because it doesn't have a child its value, the empty string, it also attempts to assign the empty string to _value?

srowen commented 1 year ago

Yes it's the <property ... value=...> that triggers this. It looks like every property tag is empty, so the _tag_value will always be null, I'd guess, not the empty string. Arguably it should realize that the tag is always empty, at least in the input it infers on, and then not try to put a _value col in the schema that will never get filled in. I could go either way on it; the case where it causes a problem (i.e. here) is rare, and can be worked around. But in this case the col is superfluous.

brcopeland commented 1 year ago

Ok thanks again for your help that clarifies things. Now onward to actually using the data.