Closed theferrit32 closed 11 months ago
Based on discussion with @larrybabb we are reliant on the $
syntax created by the scala codebase. So we need to make sure our serialized JSON form of XML documents are in the same format.
The scala clinvar-ingest codebase is using this XML library:
xmltodict docs relevant to this: https://xmltodict.readthedocs.io/en/stable/README/#roundtripping
They refer to this document and appear to follow it: https://www.xml.com/pub/a/2006/05/31/converting-between-xml-and-json.html
Key detail from that doc
<e>text</e> -> "e": "text"
<e name="value">text</e> -> "e": { "@name": "value", "#text": "text" }
Weirder case I tested out:
test.xml
:
<e E="e-val">
something
<a A="a-val">text</a>
<a>text</a>
something else
</e>
parse:
with open("test.xml") as f:
contents = f.read()
d = xmltodict.parse(contents)
print(json.dumps(d, indent=2))
Output:
{
"e": {
"@E": "e-val",
"a": [
{
"@A": "a-val",
"#text": "text"
},
"text"
],
"#text": "something\n \n \n something else"
}
}
Closed by adding this as the postprocessor
argument in xmltodict.parse
The XML parser DSP used appears to put inner text as
$
, whilexmltodict
in python puts it as#text
if there are attributes, and as the literal field value if there are no attributes (inconsistent).e.g.
Input XML:
DSP Scala output:
xmltodict: