CSIRO-enviro-informatics / dpn-ontology

Data Provider Node ontology
Other
1 stars 4 forks source link

Is there a missing class - :ServiceImplementation #6

Open dr-shorthair opened 7 years ago

dr-shorthair commented 7 years ago

e.g. GeoServer - which is an implementation of WFS + WMS

jyucsiro commented 7 years ago

Good question. Initially DPN-O modelled this as dpn:Service. See example of Geoserver in the Services module - https://github.com/CSIRO-LW-LD/dpn-ontology/blob/master/dpn-services.ttl

The dpns:Geoserver description declares the implemented interfaces as OWL restrictions, though we only did it for WFS.

sharon-tickell commented 7 years ago

I actually think the missing layer is 'catalog': Examples of these would be THREDDS (as in the existing eReefs DBL) or CSW (GeoNetwork, CKAN etc - pick one or more standards) or GeoServer's REST API (and I'm sure there are others). Catalogs are maintained by Sysadmins and/or data Managers on behalf of an organisation, and may have service-level agreements.

Each catalog can then be harvested to identify datasets. Datasets have descriptions, provenance, owners / custodians, licences, spatio-temporal bounds and semantic tags etc, and each one should have one or more HTTP(S)-accessible services/endpoints (I'm not sure of the best terminology there). These would implement OGC interfaces like WMS (which could be specialised as NC-WMS for thredds/erddap etc), WFS, or equivalent non-standard but well-understood data-retrieval APIs like the SensorCloud one.

In cases where you don't want to harvest a catalog, there's no reason why you couldn't have a dummy catalog that just contains hard-coded dataset definitions, accessible via SPARQL or whatever....

dr-shorthair commented 7 years ago

@jyucsiro - Am just trying to figure out the sub-classing vs instantiation logic. Currently GeoServer is a subclass of :Service and WFS is a subclass of :ServiceInterface. The 'GA geologic-unit service' might be an instance of GeoServer, but GeoServer might also be an instance of WFS. Suggest adding skos:example to each def to help document this.

@sharon-tickell - Catalog is an overloaded term :-) In some contexts a catalogue is the registry-interface. Do you mean 'service-content-summary' or 'data-discovery-interface' ?

Note that in OGC-land, 'interface' = 'set-of-operations' so a single service might implementation multiple interfaces. In the case of THREDDS I think it implementats both a discovery and (multiple) data-access interfaces.

sharon-tickell commented 7 years ago

The terminology overloading is tricky - particularly since we have similar-but-non-overlapping sets of jargon happening!

By 'catalog' I am meaning 'thing you can ask for a list of datasets', with a large helping of 'preferably via a standard interface like CSW or equivalent' though with exceptions for something like the XML Catalogs from THREDDS which are easy to parse and in heavy use. So I think that's the Data Discovery interface?

So for example:

GeoNetwork is an implementation of my sort of catalog: I can feed its CSW URL into a hypothetical harvesting script to find information about datasets and how they may be accessed. Usually, those datasets live elsewhere, but so long as they are accessible via standard interfaces/apis, the harvester doesn't care.

GeoServer implements both a catalog (it's REST API - non standard, but well enough documented that we should handle it.) and data access APIs. I can point the harvester at the REST API URL and get back a list of all layer names and what WMS/WFS etc interfaces are enabled for each one. GeoServer also implements the data access interfaces, but from the Harvester's point of view, that is incidental - it only cares that there are well understood access interfaces for each layer, and the fact they share a base URL is incidental.

THREDDS is similar - it has a non-standard data-discovery catalog that can be harvested to identify available datasets, and a NCWMS, OPeNDAP etc address for each one. That THREDDS implements those data-access interfaces is not important: only the interface type. (this is exactly what the eReefs Harvester currently does).

I guess what I'm trying to get at in a very long-winded way is that the fact that something is a GeoServer or a THREDDS server is really unimportant from a harvesting perspective (though presumably the maintainer would need to know). What something like the DBL actually needs to know is how to get a list of datasets from a URL: so a working URL and a data-discovery API name and version.

If you have a real metadata catalog like GeoNetwork or CKAN or similar properly configured to hold all the metadata for your GeoServer or THREDDS, then you should really Harvest from the CSW interface by preference anyway: the fact that the data is accessed from a GeoServer or THREDDS server is an implementaton detail only.

jyucsiro commented 7 years ago

@dr-shorthair I have a worked example for GeoServer here: https://github.com/CSIRO-LW-LD/dpn-ontology/wiki/Example:-data.gov.au-PSMA-Vic-State-Suburb-DPN

WFS does not need to be declared in the dpn:Service instance of dpns:GeoServer. I see your point about WFS - whether we model it as a class or instance. Could be either, but at the moment WFS is a class. What would an instance of the ServiceInterface WFS look like?

jyucsiro commented 7 years ago

@sharon-tickell during the ereefs project, I did start modelling dpn:Catalog but that got pretty hairy for the reasons Simon mentions above. So at the time, we left it as just a URL reference.

Would the use case here be about understanding more about the kind of catalog? whether in the context of THREDDS, CKAN, GeoNetwork, or the interfaces like CSW?

sharon-tickell commented 7 years ago

So attempting some slightly more formal definition of what I think a DBL setup needs to know (I don't think in RDF, so my apologies for the very non-standard format!)

concept 'HarvestCatalog' has:

dr-shorthair commented 7 years ago

@sharon-tickell -

I guess what I'm trying to get at in a very long-winded way is that the fact that something is a GeoServer or a THREDDS server is really unimportant from a harvesting perspective (though presumably the maintainer would need to know). What something like the DBL actually needs to know is how to get a list of datasets from a URL: so a working URL and a data-discovery API name and version.

Yes. It is the (collection of) interfaces (each of which is a collection of operations ...) that is offered that matters. These are all just ways to lump the options into useful sized groupings that can be named. You need to know it is 'THREDDS' or 'GeoServer' if you are managing it, but it also provides a packaging of (usually unique) functionality (collection of interfaces). So it might still be useful to the end user to convey that.

sharon-tickell commented 7 years ago

@jyucsiro - yes, my currently thinking on this is informed by my attempts to try to match the conceptual stuff with the DBL implementation. Doesn't mean it's necessarily the best way, but it matches how it currently works in my head :)

The Harvester needs to bet able to work out how to get a list of datasets from a catalog. In eReefs, the Service concept ended up sort of standing in, because we had those set up as 'this one is THREDDS', and the DBL Harvest script knew how to parse THEDDS catalog XML documents. Once the list of datasets were extracted, it uses plain WMS and OPeNDAP interfaces to extract additional metadata: the fact that those were implemented by a TDS instance is completely ignored (Standards FTW!)

I'd like to sort of conflate the two formally: If the DBL configuration has a list of Harvestable catalog URIs, and it knows to interpret this one as a THREDDS catalog and the next one as a CSW catalog, then they can be treated equally (and we could have a plugin architecture that lets sysadmins add support for new or custom catalog types, too). For GeoServers, it wouldn't be the GeoServer-ness that is important: only that we should use the GeoServer REST API: GeoServer happens to be the only thing I know of that implements that API, but there's nothing actually stopping something else from doing so!

The harvester doesn't need to know at all that there's a GeoNetwork behind the CSW interface, though - only that datasets can be retrieved from the URL via CSW. We could note the server type as descriptive data, but it shouldn't be required.

dr-shorthair commented 7 years ago

@jyucsiro - previous work in OGC uncovered that there are

(a) operations (e.g. getFeature), grouped into (b) interfaces (e.g. WFS), which can be classified as (c) service types (e.g. DataService).

And then there are (d) implementations (software) (e.g. GeoServer) which offer one or more interfaces, and (e) deployments (endpoints) (e.g. the GA Geologic Feature Service) of (implementations of) interfaces, usually (but not always) strongly bound to (f) datasets or layers.

The key thing I was trying to disentangle was how each of these was envisioned in DPN - mostly the individual vs. subclass issue. If it is an individual, then rdf:type and other dpn:predicates (and skos:, dcat: etc) will do the semantic work. If we use sub-classing then is is rdfs:subClassOf/owl:Restriction that does the semantic work, which gives stronger reasoning, but makes it more difficult to add associative relationships (you probably know the theory better than me).

The attraction of individuals is that they can be listed in registers, and it is relatively easy to add more associations. The downside is that reasoning is bespoke.

sharon-tickell commented 7 years ago

@dr-shorthair I take your point that users may like to know the particular server implementation, but the DBL doesn't. From a sysadmin's perspective, I can imagine situations in which that's a security risk (you may not want to advertise what software you use). Or alternatively, you might rather discover it than hard-code it: e.g. if you're migrating systems between, say CKAN and GeoNetwork... it's just one more thing to keep manually updated. If you have a valid use case from elsewhere then no worries though.

@jyucsiro - and to add on to my precious point.... I just discovered that GeoServer has a CSW extension. So we could conceivably require that for any GeoServers which don't have external metadata catalogs rather than needing to code support for the GeoServer REST API into the DBL :).

dr-shorthair commented 7 years ago

@sharon-tickell - yes, thanks. I've just updated my list above to say

(e) deployments (endpoints) (e.g. the GA Geologic Feature Service) of (implementations of) interfaces

You are arguing that is the endpoint must be classified by the interface, but not necessarily through the specific implementation. I agree in principle. Perhaps this is why I'm uneasy about seeing the implementations modelled as classes, rather than an instances of a (perhaps missing) class (in the title of this issue).

@jyucsiro - perhaps this discussion also relates to the role of dpn:Node and https://github.com/CSIRO-LW-LD/dpn-ontology/issues/4