b-cube / semantics-preprocessing

initial text preprocessors for the triplestore and feature classification
Other
2 stars 3 forks source link

Add the nested dataset parsing for THREDDS catalogs #62

Closed roomthily closed 9 years ago

roomthily commented 9 years ago

Where we can have multiple structures.

A dataset with date, data size and access values (in multiple ways!)

<dataset name="3982-a.nc" ID="TSdata/PV_SHELF/3982-a.nc" urlPath="TSdata/PV_SHELF/3982-a.nc">
  <dataSize units="Mbytes">1.350</dataSize>
  <date type="modified">2014-12-02T21:12:29Z</date>
</dataset>

or

<thredds:dataset name="L3_ozavg_n7t_198904.txt" ID="/opendap/hyrax/Nimbus7_TOMS_Level3/TOMSN7L3mtoz.008/1989/L3_ozavg_n7t_198904.txt">
    <thredds:dataSize units="bytes">162783</thredds:dataSize>
    <thredds:date type="modified">2011-02-24T19:12:17</thredds:date>
    <thredds:access serviceName="file" urlPath="/Nimbus7_TOMS_Level3/TOMSN7L3mtoz.008/1989/L3_ozavg_n7t_198904.txt"/>
</thredds:dataset>

where the difference is not the namespace but the child element vs attribute for the access route.

A dataset can have one metadata element

<metadata inherited="true">
  <serviceName>allServices</serviceName>
  <publisher>
    <name vocabulary="DIF">USGS/ER/WHCMSC/xxxxxxxxx</name>
    <contact url="http://www.xxxxxx.gov/" email="xxxxxx@xxxxx.gov" />
  </publisher>
</metadata>

and this is inherited from one of the service elements at the same level as the dataset parent of the metadata. And that set of services elements can have multiple references to some metadata service (iso, etc).

or

<metadata xlink:href="http://data.eol.ucar.edu/jedi/catalog/ucar.ncar.eol.dataset.106_359.metadata.xml" metadataType="THREDDS" inherited="true"/>

or some actual metadata endpoint: http://data.eol.ucar.edu/jedi/catalog/ucar.ncar.eol.dataset.106_277.metadata.xml

And then we can have multiple dataset children or multiple catalogRef children under a dataset. And a catalogRef can have it's own metadata child (probably also the nested dataset child). Wow THREDDS, let's just not do anything the same way twice.

roomthily commented 9 years ago

See 8751ffd

roomthily commented 9 years ago

See 9cc6d7d.

Still to do - add the excludes xpath list updates.

This should, knock on wood, handle most of the content even if the service isn't "valid".

roomthily commented 9 years ago

Note - I am not using an element value to generate a hash for the missing ID. The text is often pretty static across instances (it's always OPeNDAP for type and often '/opendap/hyrax' for the path) so there's not much to ensure a unique ID across graphs and we could get confusing parentage. We're not running some sort of 70s commune in metadata form here.

roomthily commented 9 years ago

See 5238229.