geo4web-testbed / ldproxy-design

Design of the "linked data proxy"/"crawlifier proxy"
0 stars 0 forks source link

URI scheme of resources #1

Closed cportele closed 8 years ago

cportele commented 8 years ago

Current idea for the simple BAG WFS with potential groupings "woonplaats" and "postcode" is shown in the following overview:

screen shot 2015-12-08 at 10 36 43

Known discussion items:

joostfarla commented 8 years ago

Looking at this from a RESTful perspective, filtering should be handled in the querystring. URLs would then look like this:

Collection of addresses

/adressen
/adressen?woonplaats=1664
/adressen?postcode=1234AB

Single address

/adressen/321

In the response, links to related objects can be described by using a hypermedia format, like HAL or JSON API. Also pagination can be implemented by providing links to the next and prev URLs.

Some of the benefits of this approach:

If you would like to expose a list of all "woonplaatsen", this would have its own resource:

/woonplaatsen

To expose hierarchical links to all "adressen" within one "woonplaats", you have 2 options:

Provide a subresource to get a paginated list of all "adressen" in the "woonplaats":

/woonplaatsen/1664/adressen

The links in this collection's items should point to the /adressen/{id} path (i.e. there is no /woonplaatsen/{id}/adressen/{id}).

Probably a more elegant way of doing this is providing a hypermedia link in the list and/or detail response:

"id": "1664",
"name": "Amersfoort",
"_links": {
  "adressen": {
    "href": "/adressen?woonplaats=123"
  }
}

Have a look at the Postcode API (which is also based on the BAG) for an example. Also the URI strategy of the official BAG is similar: http://bag.kadaster.nl/doc/woonplaats/1664

If yes, how to implement navigation for a human user?

I'm not sure what you mean with this question. Humans should navigate by following hyperlinks, right? Breadcrumbs (containing microdata) can be used to provide hierarchical links to the user or crawler.

Can you elaborate a bit more on the arguments to do the nested URL approach?

cportele commented 8 years ago

Thank you for the comments and your well-reasoned thoughts.

We will digest this a bit more, but let me add some initial responses to your question why we came to a different pattern.

A key requirement is that the proxy can determine its URIs automatically from any WFS without human interaction (or only minimal interaction). It is very likely that if one starts to design an API for a given dataset, you would likely end up with a different design. This is a difference between topics 3 and 4, but of course the API determined by the proxy has to be useful, otherwise the approach won't work.

The logical pattern that we see would be

/
/{featuretype}
/{featuretype}/{gmlid}

Using the BAG WFS a feature would be:

/inspireadressen/inspireadressen.7078852

Let's assume that we would configure rules to process the names provided by the WFS to something more friendly we might end up with something like

/adressen/7078852

In this case

/adressen

would include links to addresses.

Since there are 8788868 of them we need a paging approach. This is why we added page query parameters

/adressen?page={n}

Since the requirement/expectation is to use JSON-LD, our current thought was to use the hydra pattern (see #2).

The other aspect is that we also would like to support humans (that includes developers) to explore a dataset in the browser, not just machines. As noone will click through thousands of pages of addresses we considered to subsets of all features in a feature type as persistent resources. This is why we ended up with the "nested URLs".

Having groupings on the top level like

/woonplaats/Valkenswaard

would not work as "woonplaats" might, for example, conflict with a feature type of the same name.

I agree that

/woonplaats/Valkenswaard/adressen

would be cleaner, but for the WFS proxy it will also be important to find a straightforward mapping between the way how the data is structured in the WFS and the API, if possible. Also as the proxy should be simple to configure, too. At least for a short "time until the first successful use". The WFSs we are looking at in the testbed are simple and do not have many feature types, but there are others with 10s or 100s of them.

We will add additional thoughts later.

azahnen commented 8 years ago

Another point we did not consider yet is that a woonplaats or any other property might contain unsafe characters that need to be encoded. We should really avoid to use these as a URL path segment.

We could assign numerical ids in the proxy if needed, but in general I would agree with @joostfarla that we should make the API more RESTful and provide the woonplaats via querystring.

So the remaining question to me is how to get a list of all woonplaatsen, considering that woonplaats is not a top level object in the WFS, but just a property of address objects? At the moment I see three options:

  1. the initial proposal

    /addressen/woonplaats
  2. @joostfarla proposal, which would mean to make woonplaats an artificial top level object even if it is not in the WFS

    /woonplaatsen

    Even when ignoring conflicts with feature type names that @cportele mentioned, which we could handle somehow, I think changing the role of woonplaats regarding the WFS dataset is not a good idea and would bring a bunch of new problems along with it.

  3. define query parameters that do exactly what we want, which is "give me a distinct list of values of a property of this type", e.g. something like this when using method names from Groovy collections

    /addressen?collect=woonplaats&unique=true
joostfarla commented 8 years ago

Another point we did not consider yet is that a woonplaats or any other property might contain unsafe characters that need to be encoded. We should really avoid to use these as a URL path segment.

True! This also involves the risk of having duplicate IDs/names, which adds extra complexity at both sides.

So the remaining question to me is how to get a list of all woonplaatsen, considering that woonplaats is not a top level object in the WFS, but just a property of address objects?

Is it really necessary to provide a list of all "woonplaatsen", since the dataset is about "adressen"? What is the exact use case?

cportele commented 8 years ago

Is it really necessary to provide a list of all "woonplaatsen", since the dataset is about "adressen"? What is the exact use case?

To support navigation by humans through the dataset to end up at an address. We don't really have to support this, but it would be nice, if the API would also support users to click through the dataset in a more meaningful way than to go through endless pages.

azahnen commented 8 years ago

Additionally to us thinking it would be nice to have, it would be our approach to address this point from the tender:

Indexes of spatial objects, and maybe metadata records, may also be needed based on other criteria (e.g. aggregations based on location or key attributes) in order to improve findability. The research should provide recommendations regarding the resource structure and the linking between the resources.

liekeverhelst commented 8 years ago

From the Linked Data perspective we need URIs..you are discussing URL's right? How do you see URIs in this picture? BTW you can have one URI that references to a group (there will be a property in the resource that contains a list of members of the group) URIs must be persistent and not be dynamic, so parameters in the URI are not allowed.

azahnen commented 8 years ago

@liekeverhelst : Every URL is also a URI. URLs are just URIs that identify a resource via its network location. See e.g. http://www.w3.org/TR/uri-clarification/ .

It is also not clear to me why a URI would be less persistent when it contains parameters. That would be the case when the parameter transports state like a session token, but not when it is used stateless like for filtering.

I think that is exactly how it is explained in the URI design principles at http://www.w3.org/TR/ld-bp/#HTTP-URIS:

A URI structure will not contain anything that could change

But that does not mean that it should not contain parameters at all.

azahnen commented 8 years ago

We chose option 3 for now, which means we use parameters for filtering and also to get a list of all woonplaatsen or any other property.

This seems like the least "dirty" approach. Actually partial responses are quite common and were established by Google using a "fields" parameter, see e.g.: http://googlecode.blogspot.de/2010/03/making-apis-faster-introducing-partial.html

Esri uses this as well and they also have a parameter "returnDistinctValues", which is exactly what I described in option 3: http://resources.arcgis.com/en/help/rest/apiref/query.html

The API looks like this now:

/
/inspireadressen
/inspireadressen/{id}
/inspireaddressen?fields=woonplaats&distinctValues=true
/inspireadressen?woonplaats=Valkenswaard
cportele commented 8 years ago

The following work / discussion in the W3C/OGC Spatial Data on the Web group is related.

https://github.com/w3c/sdw/blob/gh-pages/subsetting/index.md https://lists.w3.org/Archives/Public/public-sdw-comments/2015Dec/0000.html

In our discussion there was a concern that putting woonplaats after adressen in a URI path would be misleading. This does not seem to be a concern in the linked data world.

liekeverhelst commented 8 years ago

@azahnen I missed your reply over the year-end..cc/ @cportele The thing I want to stress is this: if you want to refer to a RDF Resource, such as anything that we map to schema.org, then you need a unique identifier that is resolvable & dereferenceable. Thus a URL that does not change or does not depend on variables. You did this right in the proxy (which is now available) for the URLs of the HTML representations. The unique identifier of the resource in JSON-LD notation (which is currently not embedded in the HTML that is served by the proxy, but Microdata is) is the URL which belongs to the @id tag. So the webaddress to access the resource (URL) should be the same as the value in the @id tag of the resource. Note: Microdata is not a RDF serialisation.

dirkx commented 8 years ago

Much cleaner. Ignore my earlier email/comment.