DataONEorg / object-formats

DataONE Object Formats controlled vocabulary
Apache License 2.0
1 stars 3 forks source link

Issues with WaterML entries #6

Closed twhiteaker closed 3 years ago

twhiteaker commented 3 years ago

The WaterML entries are:

  <objectFormat>
    <formatId>http://www.cuahsi.org/waterML/1.0/</formatId>
    <formatName>Water Markup Language, version 1.0</formatName>
    <formatType>METADATA</formatType>
    <mediaType name="text/xml"/>
    <extension>xml</extension>
  </objectFormat>
  <objectFormat>
    <formatId>http://www.cuahsi.org/waterML/1.1/</formatId>
    <formatName>Water Markup Language, version 1.0</formatName>
    <formatType>METADATA</formatType>
    <mediaType name="text/xml"/>
    <extension>xml</extension>
  </objectFormat>

Some possible issues I noticed:

mbjones commented 3 years ago

@twhiteaker Thanks for the detailed review, super helpful. My thoughts on your points:

datadavev commented 3 years ago

We don't currently index WaterML. My impression is that data contained in WaterML documents is more analogous to data that might be contained within EML documents than CF. The reason being that CF metadata terms can be tend to be more about dataset variables (and so the expression can be highly specific to a dataset) whereas WaterML builds largely on the ISO19000 standards and so generally includes terms more broadly consistent with those used for general discovery. My impression could be off though - the distinction between data and metadata is not a sharp edge.

twhiteaker commented 3 years ago

WaterML was designed to include both metadata and data. Here's an example, and you can see the bulk of the payload is the time series of streamflow values (the <value> elements). https://waterservices.usgs.gov/nwis/iv/?sites=08158000&parameterCd=00060&period=P1D&format=waterml

The part that was throwing me off was, in the DataONE list, WaterML has type METADATA, when really a WaterML file has both metadata and data. If the format type is just used as a cue for indexing (which is sounds like it is not currently), perhaps METADATA is the best fit just to make sure the indexing happens. Otherwise, I would use DATA because WaterML was designed with data at the core of the thought process, and metadata built around it to properly describe the data.

mbjones commented 3 years ago

OK, I started a PR #10 on branch feature_6_waterml that fixes the typo in the name of WaterML 1.1, and changes METADATA to DATA. I think that represents all that can and should be fixed in this change request. Review appreciated. We can merge this to develop if there are no comments in the near future.

twhiteaker commented 3 years ago

@mbjones I was trying to find out more about why WaterML might have been tagged as METADATA. I searched in DataONE for the phrase "WaterML" and only this result came up, which doesn't appear to have anything to do with WaterML (it has an attribute water_ml, which represents water in milliliters). If the formats in the list were drawn from DataONE members, where are the WaterML examples?

I ask because with WaterML 1.0 and 1.1, it is possible to have a document that just lists observation locations (sites) and variable descriptions. I don't know why anyone would archive that instead of that+datavalues, but if they did, then maybe that's why WaterML was tagged as METADATA.

Is there a way to search DataONE by file format?

If we can't track down any examples, then I think DATA is the better fit, and what I see in PR #10 looks fine.

mbjones commented 3 years ago

It was probably listed as METADATA when CUAHSI was working on becoming a member repository, and they were publishing mainly in WaterML. We never ended up harvesting from their system, so I don't think DataONE has any WaterML in the catalog. Now, CUAHSI has talked more about using HydroShare which exposes schema.org metadata, so WaterML may never arise as a necessary format. But its registered. I don't see any harm in making it into a DATA format type for the time being given that it is mainly used to encode data values. So the PR I proposed made that change.

And yes, you can search DataONE by formatId by including formatId in query or facet fields in a Solr query like this: https://cn.dataone.org/cn/v2/query/solr/?q=*:*&fl=identifier,formatId,replicaMN,dataUrl&facet=true&facet.field=formatId&rows=0&wt=json

twhiteaker commented 3 years ago

PR looks good to me!

mbjones commented 3 years ago

Merged to develop, will be released in 1.23.