Closed brcopeland closed 1 year ago
You'll have to show the XML!
I used the files in the links I provided; the first is the XML and the second is the XSD.
Ah sorry I read right past that. My initial guess is that you are getting a field called "_value" in two ways. One, that is the default column name used to represent the value inside elements with no child XML, but a string inside, and also containing attributes. The second is from 'value=' attributes, which map to a col called "_value".
To test, try just setting valueTag
to something else like _tag_value
or whatever you want, see if that disambiguates
Thank you for your help; I thought it might have something to do with that but had been confused (still am, new to XML). Here's what I get now:
p53 = spark.read.format("xml").option("rowValidationXSDPath", "uniprot.xsd").option("rowTag", "entry").option("valueTag", "_tag_value").load("P04637.xml")
p53.printSchema()
root
|-- _created: date (nullable = true)
|-- _dataset: string (nullable = true)
|-- _modified: date (nullable = true)
|-- _version: long (nullable = true)
|-- _xmlns: string (nullable = true)
|-- accession: array (nullable = true)
| |-- element: string (containsNull = true)
|-- comment: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _evidence: string (nullable = true)
| | |-- _name: string (nullable = true)
| | |-- _type: string (nullable = true)
| | |-- cofactor: struct (nullable = true)
| | | |-- _evidence: string (nullable = true)
| | | |-- dbReference: struct (nullable = true)
| | | | |-- _id: string (nullable = true)
| | | | |-- _tag_value: string (nullable = true)
| | | | |-- _type: string (nullable = true)
| | | |-- name: string (nullable = true)
| | |-- disease: struct (nullable = true)
| | | |-- _id: string (nullable = true)
| | | |-- acronym: string (nullable = true)
| | | |-- dbReference: struct (nullable = true)
| | | | |-- _id: long (nullable = true)
| | | | |-- _tag_value: string (nullable = true)
| | | | |-- _type: string (nullable = true)
| | | |-- description: string (nullable = true)
| | | |-- name: string (nullable = true)
| | |-- event: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- _tag_value: string (nullable = true)
| | | | |-- _type: string (nullable = true)
| | |-- experiments: long (nullable = true)
| | |-- interactant: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- _intactId: string (nullable = true)
| | | | |-- dbReference: struct (nullable = true)
| | | | | |-- _id: string (nullable = true)
| | | | | |-- _tag_value: string (nullable = true)
| | | | | |-- _type: string (nullable = true)
| | | | |-- id: string (nullable = true)
| | | | |-- label: string (nullable = true)
| | |-- isoform: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- id: string (nullable = true)
| | | | |-- name: array (nullable = true)
| | | | | |-- element: string (containsNull = true)
| | | | |-- sequence: struct (nullable = true)
| | | | | |-- _ref: string (nullable = true)
| | | | | |-- _tag_value: string (nullable = true)
| | | | | |-- _type: string (nullable = true)
| | |-- link: struct (nullable = true)
| | | |-- _tag_value: string (nullable = true)
| | | |-- _uri: string (nullable = true)
| | |-- molecule: string (nullable = true)
| | |-- organismsDiffer: boolean (nullable = true)
| | |-- subcellularLocation: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- location: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- _evidence: string (nullable = true)
| | | | | | |-- _tag_value: string (nullable = true)
| | |-- text: struct (nullable = true)
| | | |-- _evidence: string (nullable = true)
| | | |-- _tag_value: string (nullable = true)
|-- dbReference: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _id: string (nullable = true)
| | |-- _tag_value: string (nullable = true)
| | |-- _type: string (nullable = true)
| | |-- molecule: struct (nullable = true)
| | | |-- _id: string (nullable = true)
| | | |-- _tag_value: string (nullable = true)
| | |-- property: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- _tag_value: string (nullable = true)
| | | | |-- _type: string (nullable = true)
| | | | |-- _value: string (nullable = true)
|-- evidence: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _key: long (nullable = true)
| | |-- _tag_value: string (nullable = true)
| | |-- _type: string (nullable = true)
| | |-- source: struct (nullable = true)
| | | |-- _ref: long (nullable = true)
| | | |-- _tag_value: string (nullable = true)
| | | |-- dbReference: struct (nullable = true)
| | | | |-- _id: string (nullable = true)
| | | | |-- _tag_value: string (nullable = true)
| | | | |-- _type: string (nullable = true)
|-- feature: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _description: string (nullable = true)
| | |-- _evidence: string (nullable = true)
| | |-- _id: string (nullable = true)
| | |-- _type: string (nullable = true)
| | |-- ligand: struct (nullable = true)
| | | |-- dbReference: struct (nullable = true)
| | | | |-- _id: string (nullable = true)
| | | | |-- _tag_value: string (nullable = true)
| | | | |-- _type: string (nullable = true)
| | | |-- name: string (nullable = true)
| | |-- location: struct (nullable = true)
| | | |-- begin: struct (nullable = true)
| | | | |-- _position: long (nullable = true)
| | | | |-- _tag_value: string (nullable = true)
| | | |-- end: struct (nullable = true)
| | | | |-- _position: long (nullable = true)
| | | | |-- _tag_value: string (nullable = true)
| | | |-- position: struct (nullable = true)
| | | | |-- _position: long (nullable = true)
| | | | |-- _tag_value: string (nullable = true)
| | |-- original: string (nullable = true)
| | |-- variation: string (nullable = true)
|-- gene: struct (nullable = true)
| |-- name: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- _tag_value: string (nullable = true)
| | | |-- _type: string (nullable = true)
|-- keyword: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _id: string (nullable = true)
| | |-- _tag_value: string (nullable = true)
|-- name: string (nullable = true)
|-- organism: struct (nullable = true)
| |-- dbReference: struct (nullable = true)
| | |-- _id: long (nullable = true)
| | |-- _tag_value: string (nullable = true)
| | |-- _type: string (nullable = true)
| |-- lineage: struct (nullable = true)
| | |-- taxon: array (nullable = true)
| | | |-- element: string (containsNull = true)
| |-- name: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- _tag_value: string (nullable = true)
| | | |-- _type: string (nullable = true)
|-- protein: struct (nullable = true)
| |-- alternativeName: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- fullName: string (nullable = true)
| |-- recommendedName: struct (nullable = true)
| | |-- fullName: string (nullable = true)
|-- proteinExistence: struct (nullable = true)
| |-- _tag_value: string (nullable = true)
| |-- _type: string (nullable = true)
|-- reference: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _evidence: long (nullable = true)
| | |-- _key: long (nullable = true)
| | |-- citation: struct (nullable = true)
| | | |-- _date: string (nullable = true)
| | | |-- _db: string (nullable = true)
| | | |-- _first: string (nullable = true)
| | | |-- _last: string (nullable = true)
| | | |-- _name: string (nullable = true)
| | | |-- _type: string (nullable = true)
| | | |-- _volume: long (nullable = true)
| | | |-- authorList: struct (nullable = true)
| | | | |-- consortium: struct (nullable = true)
| | | | | |-- _name: string (nullable = true)
| | | | | |-- _tag_value: string (nullable = true)
| | | | |-- person: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- _name: string (nullable = true)
| | | | | | |-- _tag_value: string (nullable = true)
| | | |-- dbReference: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- _id: string (nullable = true)
| | | | | |-- _tag_value: string (nullable = true)
| | | | | |-- _type: string (nullable = true)
| | | |-- title: string (nullable = true)
| | |-- scope: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- source: struct (nullable = true)
| | | |-- tissue: array (nullable = true)
| | | | |-- element: string (containsNull = true)
|-- sequence: struct (nullable = true)
| |-- _checksum: string (nullable = true)
| |-- _length: long (nullable = true)
| |-- _mass: long (nullable = true)
| |-- _modified: date (nullable = true)
| |-- _tag_value: string (nullable = true)
| |-- _version: long (nullable = true)
So the problem appears to be resolved; I see that dbReference.property.element
has both _tag_value
and _value
. Just trying to understand the nature of the problem; there are e.g. dbReference
records in the XML like:
<dbReference type="PDB" id="7RM4">
<property type="method" value="X-ray"/>
<property type="resolution" value="3.33 A"/>
<property type="chains" value="C/H/M/R=168-176"/>
</dbReference>
so that from the tag value="X-ray"
this gets assigned to column _value
with value "X-ray"
and because it doesn't have a child its value, the empty string, it also attempts to assign the empty string to _value
?
Yes it's the <property ... value=...>
that triggers this. It looks like every property
tag is empty, so the _tag_value
will always be null, I'd guess, not the empty string. Arguably it should realize that the tag is always empty, at least in the input it infers on, and then not try to put a _value
col in the schema that will never get filled in. I could go either way on it; the case where it causes a problem (i.e. here) is rare, and can be worked around. But in this case the col is superfluous.
Ok thanks again for your help that clarifies things. Now onward to actually using the data.
I am attempting to parse data using pyspark from UniProt (small example file available here) which has a schema here.
My command with error stack trace is:
Any suggestions on what I can do to either fix or debug this?