GSA / datagov-wptheme

Data.gov WordPress Theme (obsolete)
https://www.data.gov
Other
1.88k stars 410 forks source link

Online Linkage mandatory for Geospatial records? #559

Open torrin47 opened 9 years ago

torrin47 commented 9 years ago

Moving this issue from here for more traction and visibility: https://github.com/GSA/ckanext-geodatagov/issues/92

We've finally been able to start testing harvests of native geospatial records, and are finding that FGDC records that have no onlink parameter fail the harvest with an error message of "No resources invalid metadata". Since online linkage is a mandatory-if-applicable field, is there a reason this particular rule is being enforced? Could it possibly be relaxed?

We've confirmed that the same requirement exists for ISO records - records without a MD_DigitalTransferOptions linkage element are rejected as invalid. We appreciate that online linkages are at the core of what Project Open Data is all about, but object to inconsistent validation standards. Data.json files that lack a distribution section do pass validation.

kvuppala commented 9 years ago

@torrin47 we will look into this suggestion and discuss the recommendation with data.gov and geospatial team for relaxing the rule

JJediny commented 8 years ago

@torrin47 can you provide the service(s) you are trying to harvest or the software/standards being used? When you say "native" geospatial records do you mean static datasets hosted on a WAF or FTP? are you trying to harvest them individually? I'm assuming this isn't the case so it's be good to get a better idea of what/how the records are being harvested

torrin47 commented 8 years ago

The EPA has Geospatial records divided out into two WAFs by format (FGDC CSDGM and ISO) for harvesting by Data.gov CKAN: https://edg.epa.gov/WAFer_harvest/ Non-geospatial records are made available in a standalone data.json file: https://edg.epa.gov/data-nonspatial-harvest.json We fully support the ideal of making all datasets available through downloads and open APIs, but there are plenty of valid reasons why a dataset might not currently be available in such a manner. It still serves the public interest to allow these records without online linkages to be listed in data.gov.

JJediny commented 8 years ago

Seems like these records include Distribution to the direct access to a downloadable URI(s), ESRI Rest URL(s), and other online resources nested within gmd:onLine(s). It would help to know why the static metadata is being generated when it'd be possible to harvest the ArcGIS REST Services directly and/or the EPA EDG geoportal?

    {
      "name" : "EPA [Geodata]",
      "type" : "esri-mapServer-group",
      "url" : "http://geodata.epa.gov/ArcGIS/rest/services/"
    },
    {
      "name" : "EPA [Environmental Data Gateway]",
      "type" : "esri-mapServer-group",
      "url" : "https://edg.epa.gov/ArcGIS/rest/services/"
    },
    {
      "name" : "EPA [MyEnviroment]",
      "type" : "esri-mapServer-group",
      "url" : "http://map23.epa.gov/ArcGIS/rest/services/"
    },
    {
      "name" : "EPA [Office of Water]",
      "type" : "esri-mapServer-group",
      "url" : "http://watersgeo.epa.gov/ArcGIS/rest/Services/"
    },
    {
      "name" : "EPA [EnviroAtlas]",
      "type" : "esri-mapServer-group",
      "url" : "http://enviroatlas.epa.gov/arcgis/rest/services/"
    },
    {
      "name" : "EPA [Environmental Justice]",
      "type" : "esri-mapServer-group",
      "url" : "http://ejscreen.epa.gov/arcgis/rest/services/"
    },
    {
      "name" : "ESRI [Federal Data Services]",
      "type" : "esri-mapServer-group",
      "url" : "http://server.arcgisonline.com/ArcGIS/rest/services/"
    },

I can't say that I see the value in only registering metadata alone and not providing all the accessible resources/services. If is a matter of registering these records as a WAF as is - then changes to the CKAN Harvester would be needed to hard code the mapping of gmd:onLine to Distribution. Which doesn't seem practical - perhaps importing the records into a stand alone pyCSW instance could too resolve the issue if it wouldn't be hard to test with a few records?

torrin47 commented 8 years ago

The referenced WAFs reflect the content in the EPA EDG GeoPortal server and are refreshed nightly. Harvesting via WAF proved to be more reliable and easier to debug than harvesting via CSW after much testing. Direct harvesting of ArcGIS Server services might be practical for certain use cases, but it provides compliance with neither FGDC nor data.gov minimal validation standards.

And again, we strongly support the principles of open data by making data resources available through multiple public online channels, but believe there are some legitimate cases where the data is not presently available at a public URL, but the public nonetheless benefits from having access to the complete metadata record (abstract, contact info, etc). Blocking records with no online resource that are otherwise complete is inconsistent with the principles or specifications of Project Open Data.