clingen-data-model / clinvar-ingest

Apache License 2.0
2 stars 0 forks source link

Determine how to handle inner text of textual elements #34

Closed theferrit32 closed 11 months ago

theferrit32 commented 11 months ago

The XML parser DSP used appears to put inner text as $, while xmltodict in python puts it as #text if there are attributes, and as the literal field value if there are no attributes (inconsistent).

e.g.

Input XML:

<Gene Symbol="CYP2C19" FullName="cytochrome P450 family 2 subfamily C member 19" GeneID="1557" HGNC_ID="HGNC:2621" Source="submitted" RelationshipType="within single gene">
    <Location>
        <CytogeneticLocation>10q23.33</CytogeneticLocation>
        <SequenceLocation Assembly="GRCh38"
            AssemblyAccessionVersion="GCF_000001405.38" AssemblyStatus="current"
            Chr="10" Accession="NC_000010.11"
            start="94762681" stop="94855547"
            display_start="94762681" display_stop="94855547" Strand="+"/>
        <SequenceLocation Assembly="GRCh37"
            AssemblyAccessionVersion="GCF_000001405.25" AssemblyStatus="previous"
            Chr="10" Accession="NC_000010.10"
            start="96522462" stop="96612670"
            display_start="96522462" display_stop="96612670" Strand="+"/>
    </Location>
    <OMIM fakeattr="fakeattrval">124020</OMIM>
</Gene>

DSP Scala output:

{
    "Location": {
        "CytogeneticLocation": {
            "$": "10q23.33"
        },
        "SequenceLocation": [
            {
                "@Accession": "NC_000010.11",
                "@Assembly": "GRCh38",
                "@AssemblyAccessionVersion": "GCF_000001405.38",
                "@AssemblyStatus": "current",
                "@Chr": "10",
                "@Strand": "+",
                "@display_start": "94762681",
                "@display_stop": "94855547",
                "@start": "94762681",
                "@stop": "94855547"
            },
            {
                "@Accession": "NC_000010.10",
                "@Assembly": "GRCh37",
                "@AssemblyAccessionVersion": "GCF_000001405.25",
                "@AssemblyStatus": "previous",
                "@Chr": "10",
                "@Strand": "+",
                "@display_start": "96522462",
                "@display_stop": "96612670",
                "@start": "96522462",
                "@stop": "96612670"
            }
        ]
    },
    "OMIM": {
        "$": "124020",
        "@fakeattr": "fakeattrval" // example only, I added this
    }
}

xmltodict:

{
  "Gene": {
    "@Symbol": "CYP2C19",
    "@FullName": "cytochrome P450 family 2 subfamily C member 19",
    "@GeneID": "1557",
    "@HGNC_ID": "HGNC:2621",
    "@Source": "submitted",
    "@RelationshipType": "within single gene",
    "Location": {
      "CytogeneticLocation": "10q23.33",
      "SequenceLocation": [
        {
          "@Assembly": "GRCh38",
          "@AssemblyAccessionVersion": "GCF_000001405.38",
          "@AssemblyStatus": "current",
          "@Chr": "10",
          "@Accession": "NC_000010.11",
          "@start": "94762681",
          "@stop": "94855547",
          "@display_start": "94762681",
          "@display_stop": "94855547",
          "@Strand": "+"
        },
        {
          "@Assembly": "GRCh37",
          "@AssemblyAccessionVersion": "GCF_000001405.25",
          "@AssemblyStatus": "previous",
          "@Chr": "10",
          "@Accession": "NC_000010.10",
          "@start": "96522462",
          "@stop": "96612670",
          "@display_start": "96522462",
          "@display_stop": "96612670",
          "@Strand": "+"
        }
      ]
    },
    "OMIM": {
      "@fakeattr": "fakeattrval",
      "#text": "124020"
    }
  }
}
theferrit32 commented 11 months ago

Based on discussion with @larrybabb we are reliant on the $ syntax created by the scala codebase. So we need to make sure our serialized JSON form of XML documents are in the same format.

The scala clinvar-ingest codebase is using this XML library:

https://com-lihaoyi.github.io/upickle/

theferrit32 commented 11 months ago

xmltodict docs relevant to this: https://xmltodict.readthedocs.io/en/stable/README/#roundtripping

They refer to this document and appear to follow it: https://www.xml.com/pub/a/2006/05/31/converting-between-xml-and-json.html

Key detail from that doc


<e>text</e>  ->  "e": "text"

<e name="value">text</e>  ->  "e": { "@name": "value", "#text": "text" }

Weirder case I tested out:

test.xml:

<e E="e-val">
    something
    <a A="a-val">text</a>
    <a>text</a>
    something else
</e>

parse:

with open("test.xml") as f:
    contents = f.read()
    d = xmltodict.parse(contents)
    print(json.dumps(d, indent=2))

Output:

{
  "e": {
    "@E": "e-val",
    "a": [
      {
        "@A": "a-val",
        "#text": "text"
      },
      "text"
    ],
    "#text": "something\n    \n    \n    something else"
  }
}
theferrit32 commented 11 months ago

Closed by adding this as the postprocessor argument in xmltodict.parse

https://github.com/clingen-data-model/clinvar-ingest/blob/bb288af5f083bea7fda30540d658552d411f8dd3/clinvar_ingest/reader.py#L73-L86