databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
500 stars 226 forks source link

Empty value depending on order #577

Closed tooptoop4 closed 2 years ago

tooptoop4 commented 2 years ago

XML file one shows all values as expected

<?xml version='1.0' ?>
<!DOCTYPE datasets SYSTEM "http://www.cs.washington.edu/research/projects/xmltk/xmldata/data/nasa/dataset_053.dtd">
<datasets>
 <dataset subject="astronomy" xmlns:xlink="http://www.w3.org/XML/XLink/0.9">
  <isdentifier>I_5.xml</isdentifier>
  <title>Proper Motions of Stars in the Zone Catalogue -40 to -52 degrees
of 20843 Stars for 1900</title>
  <altname type="ADC">1005</altname>
  <altname type="CDS">I/5</altname>
  <altname type="brief">Proper Motions in Cape Zone Catalogue -40/-52</altname>
  <reference>
   <source>
    <other>
     <title>Proper Motions of Stars in the Zone Catalogue -40 to -52 degrees
of 20843 Stars for 1900</title>
     <author>
      <initial>J</initial>
      <initial>H</initial>
      <lastName>Spencer</lastName>
     </author>
     <author>
      <initial>J</initial>
      <lastName>Jackson</lastName>
     </author>
     <name>His Majesty's Stationery Office, London</name>
     <publisher>???</publisher>
     <city>???</city>
     <date>
      <year>1936</year>
     </date>
    </other>
   </source>
  </reference>
  <keywords parentListURL="http://messier.gsfc.nasa.gov/xml/keywordlists/adc_keywords.html">
   <keyword xlink:href="Positional_data.html">Positional data</keyword>
   <keyword xlink:href="Proper_motions.html">Proper motions</keyword>
  </keywords>
  <descriptions>
   <description>
    <para>This catalog, listing the proper motions of 20,843 stars
    from the Cape Astrographic Zones, was compiled from three series of
    photographic plates. The plates were taken at the Royal Observatory,
    Cape of Good Hope, in the following years: 1892-1896, 1897-1910,
    1923-1928. Data given include centennial proper motion, photographic
    and visual magnitude, Harvard spectral type, Cape Photographic
    Durchmusterung (CPD) identification, epoch, right ascension and
    declination for 1900.</para>
   </description>
   <details/>
  </descriptions>
  <tableHead>
   <tableLinks>
    <tableLink xlink:href="czc.dat">
     <title>The catalogue</title>
    </tableLink>
   </tableLinks>
   <fields>
    <field>
     <name>---</name>
     <definition>Number 5</definition>
     <units>---</units>
    </field>
    <field>
     <name>CZC</name>
     <definition>Catalogue Identification Number</definition>
     <units>---</units>
    </field>
    <field>
     <name>Vmag</name>
     <definition>Visual Magnitude</definition>
     <units>mag</units>
    </field>
    <field>
     <name>RAh</name>
     <definition>Right Ascension for 1900 hours</definition>
     <units>h</units>
    </field>
    <field>
     <name>RAm</name>
     <definition>Right Ascension for 1900 minutes</definition>
     <units>min</units>
    </field>
    <field>
     <name>RAcs</name>
     <definition>Right Ascension seconds in 0.01sec 1900</definition>
     <units>0.01s</units>
    </field>
    <field>
     <name>DE-</name>
     <definition>Declination Sign</definition>
     <units>---</units>
    </field>
    <field>
     <name>DEd</name>
     <definition>Declination for 1900 degrees</definition>
     <units>deg</units>
    </field>
    <field>
     <name>DEm</name>
     <definition>Declination for 1900 arcminutes</definition>
     <units>arcmin</units>
    </field>
    <field>
     <name>DEds</name>
     <definition>Declination for 1900 arcseconds</definition>
     <units>0.1arcsec</units>
    </field>
    <field>
     <name>Ep-1900</name>
     <definition>Epoch -1900</definition>
     <units>cyr</units>
    </field>
    <field>
     <name>CPDZone</name>
     <definition>Cape Photographic
                                        Durchmusterung Zone</definition>
     <units>---</units>
    </field>
    <field>
     <name>CPDNo</name>
     <definition>Cape Photographic Durchmusterung Number</definition>
     <units>---</units>
    </field>
    <field>
     <name>Pmag</name>
     <definition>Photographic Magnitude</definition>
     <units>mag</units>
    </field>
    <field>
     <name>Sp</name>
     <definition>HD Spectral Type</definition>
     <units>---</units>
    </field>
    <field>
     <name>pmRAs</name>
     <definition>Proper Motion in RA
      <footnote>
       <para>the relation is   pmRA = 15 * pmRAs * cos(DE)
    if pmRAs is expressed in s/yr and pmRA in arcsec/yr</para>
      </footnote>
     </definition>
     <units>0.1ms/yr</units>
    </field>
    <field>
     <name>pmRA</name>
     <definition>Proper Motion in RA</definition>
     <units>mas/yr</units>
    </field>
    <field>
     <name>pmDE</name>
     <definition>Proper Motion in Dec</definition>
     <units>mas/yr</units>
    </field>
   </fields>
  </tableHead>
 </dataset>
</datasets>

XML file two shows isdentifier value as null

<?xml version='1.0' ?>
<!DOCTYPE datasets SYSTEM "http://www.cs.washington.edu/research/projects/xmltk/xmldata/data/nasa/dataset_053.dtd">
<datasets>
 <dataset subject="astronomy" xmlns:xlink="http://www.w3.org/XML/XLink/0.9">
  <title>Proper Motions of Stars in the Zone Catalogue -40 to -52 degrees
of 20843 Stars for 1900</title>
  <altname type="ADC">1005</altname>
  <altname type="CDS">I/5</altname>
  <altname type="brief">Proper Motions in Cape Zone Catalogue -40/-52</altname>
  <reference>
   <source>
    <other>
     <title>Proper Motions of Stars in the Zone Catalogue -40 to -52 degrees
of 20843 Stars for 1900</title>
     <author>
      <initial>J</initial>
      <initial>H</initial>
      <lastName>Spencer</lastName>
     </author>
     <author>
      <initial>J</initial>
      <lastName>Jackson</lastName>
     </author>
     <name>His Majesty's Stationery Office, London</name>
     <publisher>???</publisher>
     <city>???</city>
     <date>
      <year>1936</year>
     </date>
    </other>
   </source>
  </reference>
  <keywords parentListURL="http://messier.gsfc.nasa.gov/xml/keywordlists/adc_keywords.html">
   <keyword xlink:href="Positional_data.html">Positional data</keyword>
   <keyword xlink:href="Proper_motions.html">Proper motions</keyword>
  </keywords>
  <descriptions>
   <description>
    <para>This catalog, listing the proper motions of 20,843 stars
    from the Cape Astrographic Zones, was compiled from three series of
    photographic plates. The plates were taken at the Royal Observatory,
    Cape of Good Hope, in the following years: 1892-1896, 1897-1910,
    1923-1928. Data given include centennial proper motion, photographic
    and visual magnitude, Harvard spectral type, Cape Photographic
    Durchmusterung (CPD) identification, epoch, right ascension and
    declination for 1900.</para>
   </description>
   <details/>
  </descriptions>
  <tableHead>
   <tableLinks>
    <tableLink xlink:href="czc.dat">
     <title>The catalogue</title>
    </tableLink>
   </tableLinks>
   <fields>
    <field>
     <name>---</name>
     <definition>Number 5</definition>
     <units>---</units>
    </field>
    <field>
     <name>CZC</name>
     <definition>Catalogue Identification Number</definition>
     <units>---</units>
    </field>
    <field>
     <name>Vmag</name>
     <definition>Visual Magnitude</definition>
     <units>mag</units>
    </field>
    <field>
     <name>RAh</name>
     <definition>Right Ascension for 1900 hours</definition>
     <units>h</units>
    </field>
    <field>
     <name>RAm</name>
     <definition>Right Ascension for 1900 minutes</definition>
     <units>min</units>
    </field>
    <field>
     <name>RAcs</name>
     <definition>Right Ascension seconds in 0.01sec 1900</definition>
     <units>0.01s</units>
    </field>
    <field>
     <name>DE-</name>
     <definition>Declination Sign</definition>
     <units>---</units>
    </field>
    <field>
     <name>DEd</name>
     <definition>Declination for 1900 degrees</definition>
     <units>deg</units>
    </field>
    <field>
     <name>DEm</name>
     <definition>Declination for 1900 arcminutes</definition>
     <units>arcmin</units>
    </field>
    <field>
     <name>DEds</name>
     <definition>Declination for 1900 arcseconds</definition>
     <units>0.1arcsec</units>
    </field>
    <field>
     <name>Ep-1900</name>
     <definition>Epoch -1900</definition>
     <units>cyr</units>
    </field>
    <field>
     <name>CPDZone</name>
     <definition>Cape Photographic
                                        Durchmusterung Zone</definition>
     <units>---</units>
    </field>
    <field>
     <name>CPDNo</name>
     <definition>Cape Photographic Durchmusterung Number</definition>
     <units>---</units>
    </field>
    <field>
     <name>Pmag</name>
     <definition>Photographic Magnitude</definition>
     <units>mag</units>
    </field>
    <field>
     <name>Sp</name>
     <definition>HD Spectral Type</definition>
     <units>---</units>
    </field>
    <field>
     <name>pmRAs</name>
     <definition>Proper Motion in RA
      <footnote>
       <para>the relation is   pmRA = 15 * pmRAs * cos(DE)
    if pmRAs is expressed in s/yr and pmRA in arcsec/yr</para>
      </footnote>
     </definition>
     <units>0.1ms/yr</units>
    </field>
    <field>
     <name>pmRA</name>
     <definition>Proper Motion in RA</definition>
     <units>mas/yr</units>
    </field>
    <field>
     <name>pmDE</name>
     <definition>Proper Motion in Dec</definition>
     <units>mas/yr</units>
    </field>
   </fields>
  </tableHead>
  <isdentifier>I_5.xml</isdentifier>
 </dataset>
</datasets>

code used:

scala> val df = spark.read.format("xml").option("rowTag","dataset").load("sii.txt") df: org.apache.spark.sql.DataFrame = [_subject: string, _xmlns:xlink: string ... 7 more fields]

scala> df.show() +---------+--------------------+--------------------+--------------------+-----------+--------------------+--------------------+--------------------+--------------------+ | _subject| _xmlns:xlink| altname| descriptions|isdentifier| keywords| reference| tableHead| title| +---------+--------------------+--------------------+--------------------+-----------+--------------------+--------------------+--------------------+--------------------+ |astronomy|http://www.w3.org...|[{1005, ADC}, {I/...|{{This catalog, l...| I_5.xml|{http://messier.g...|{{{[{[J, H], Spen...|{{[{Number 5, ---...|Proper Motions of...| +---------+--------------------+--------------------+--------------------+-----------+--------------------+--------------------+--------------------+--------------------+

scala> val df = spark.read.format("xml").option("rowTag","dataset").load("sii.txt") df: org.apache.spark.sql.DataFrame = [_subject: string, _xmlns:xlink: string ... 7 more fields]

scala> df.show() +---------+--------------------+--------------------+--------------------+-----------+--------------------+--------------------+--------------------+--------------------+ | _subject| _xmlns:xlink| altname| descriptions|isdentifier| keywords| reference| tableHead| title| +---------+--------------------+--------------------+--------------------+-----------+--------------------+--------------------+--------------------+--------------------+ |astronomy|http://www.w3.org...|[{1005, ADC}, {I/...|{{This catalog, l...| null|{http://messier.g...|{{{[{[J, H], Spen...|{{[{Number 5, ---...|Proper Motions of...| +---------+--------------------+--------------------+--------------------+-----------+--------------------+--------------------+--------------------+--------------------+

scala>

cc @srowen

srowen commented 2 years ago

I can't make out what you're trying to demonstrate here. Describe it, and simplify?

tooptoop4 commented 2 years ago

isdentifier value appears as null in 2nd example @srowen

srowen commented 2 years ago

It doesn't appear in any of your output - please boil down and format the output. The two code snippets look identical too.

tooptoop4 commented 2 years ago

it does appear in 1st output, see I_5.xml

srowen commented 2 years ago

I see it now, yeah. The problem is this:

                    <definition>Proper Motion in RA
                        <footnote>
                            <para>the relation is   pmRA = 15 * pmRAs * cos(DE)
                                if pmRAs is expressed in s/yr and pmRA in arcsec/yr</para>
                        </footnote>
                    </definition>

Is that XML in the body meant to be escaped? Regardless, it's a 'bug' that this just ends up throwing off the parser, and this is ultimately related to not supporting mixed elements (text, but also tags). That's why I wonder if that content is intended, as this would be unusual for "tabular"-like XML representations (i.e. what is the desired type of definition?)

tooptoop4 commented 2 years ago

not fixed

srowen commented 2 years ago

It's a duplicate, really, of other issues. Not fixed, yes.