Swirrl / datahost-prototypes

Eclipse Public License 1.0
0 stars 0 forks source link

Release property dh:hasRevision gives a list if there are multiple revisions but not with a single revision #348

Closed xdrcft8000 closed 7 months ago

xdrcft8000 commented 7 months ago

Our JSON syntax should be specific about the expected one or manyness of certain properties.

For dh:hasChange if a release has one revision you will get a single value, and if it has many you will get an array of them. If there are any then this should ideally always be an array (to reduce variance for consumers):

i.e. this:

curl -X 'GET' \
  'https://dluhc-pmd5-prototype.publishmydata.com/data/English-Indices-of-Deprivation/releases' \
  -H 'accept: application/ld+json'
...
    {
      "@type": "dh:Release",
      "dcterms:title": "2015",
      "dh:hasRevision": "https://ldapi-prototype.gss-data.org.uk/data/English-Indices-of-Deprivation/release/2015/revision/1" 
    }

...

should be this:

...
    {
      "@type": "dh:Release",
      "dcterms:title": "2015",
      "dh:hasRevision": ["https://ldapi-prototype.gss-data.org.uk/data/English-Indices-of-Deprivation/release/2015/revision/1"] 
    }

...
RickMoynihan commented 7 months ago

Suggested first step:

Write (or find) a hurl test, that creates a release with one revision in it, and another release with two revisions in it as a reproducible example of the problem.

kjoshi commented 7 months ago

Reproducible example

I've created a Hurl test for reproducing this issue here: https://github.com/Swirrl/datahost-prototypes/blob/fd800f2da939d22ac59d664e83754c124ba573aa/datahost-ld-openapi/hurl-scripts/issue-348.hurl

As expected the test currently fails, because when dh:hasRevision only contains a single item it comes through as a string rather than an array of length 1, so it can't be counted:

error: Filter Error
  --> issue-348.hurl:96:46
   |
96 | jsonpath "$.contents.[1].['dh:hasRevision']" count == 1
   |                                              ^^^^^ invalid filter input: string
   |

Relevant code

The handler function for the .../releases endpoint is here: https://github.com/Swirrl/datahost-prototypes/blob/dluhc-integration/datahost-ld-openapi/src/tpximpact/datahost/ldapi/handlers.clj#L223-L234

The triples->ld-resource-collection function uses matcha/build to turn the triples returned from db/get-releases into Clojure maps.

The difference between releases containing single and multiple revisions appears at this point. When there are multiple revisions matcha/build rolls them up into a set:

image

whereas when there is only a single revision it comes through by itself, as a value.

The sequence of matcha maps are then passed through to ->json-document (via json-ld/compact), which converts the maps into a string, using jsonista/write-value-as-string. write-value-as-string converts the set of revisions into an array, and leaves the single value as a string.

The stringified JSON document is then passed (along with the context) to the JsonDocument/of function of the Titanium JSON-LD library to convert it into a JSON-LD document (the JsonLd.compact function from that library is also used later on).

Possible solutions

  1. Force matcha/build to always put the dh:hasRevision stuff into a set? I've not used matcha before, but from a quick look this doesn't seem to be possible.

  2. Do something with jsonista It might be possible to use a custom mapper or encoder to make sure that the stringified JSON has the correct shape. Or it might be that we could replace write-value-as-string with our own map->json-string function.

  3. Use JSON-LD Framing Titanium JSON-LD appears to support framing. So it may be possible to re-shape the JSON-LD documents once they have been created.

  4. Do it ourselves Presumably it's possible to go from the Clojure document and context maps straight to a JSON-LD document, instead of doing map -> string -> JSON-LD. But I don't know enough about the JSON-LD spec to know how difficult that would be.

Summary

Option 1 doesn't seem possible. Option 2 seems like too much of a hack. Option 3 seems like more of a 'proper' way of doing it, providing we can come up with the correct frame. Option 4 might be preferable long-term, but feels like it'd be a lot more work than option 3.

RickMoynihan commented 7 months ago

Thanks @kjoshi,

I think option 3 is what I was thinking, particularly as we are using the JSON/LD libraries to build those responses.

My suspicion is based on looking at the code and some quick research in the JSON/LD playground is that we might be able to tweak this bit of the context:

https://github.com/Swirrl/datahost-prototypes/blob/8270e6c402a88696874d6d43c2d689d56cc5578c/datahost-ld-openapi/src/tpximpact/datahost/ldapi/json_ld.clj#L18-L22

To be:

{:contents {"@id" "dh:collection-contents",
                     "@container" "@set"}
 "dh:hasChange" {"@id" "dh:hasChange",
                               "@container" "@set"}}

But also I think that context file should be represented as JSON/LD as a file on disk, on the resource path, and we should just merge anything dynamic (regarding base URI) into that. There's a place holder for that file here which we should use. i.e. I'd port the statements above into that context, and merge the dynamic stuff into that data with code

My guess is, if we do that, the existing compaction/framing machinery will follow those directives and ensure that if there is ever a single value for this property it is always put into an array.