Islandora / documentation

Contains islandora's documentation and main issue queue.
MIT License
103 stars 71 forks source link

Use canonical urls for entity references when serializing jsonld #887

Closed dannylamb closed 5 years ago

dannylamb commented 6 years ago

Right now, entity references have URLs pointing to their jsonld counterparts, but we should probably be writing out the canonical URL. The jsonld is an alternate representation of the node/media/term, but the triples are describing the canonical resource, independent of any other serialization.

In other words, with triples edited for brevity,

{
   "@graph":[
      {
         "@id":"http://localhost:8000/node/1?_format=jsonld",
         "@type":[
            "http://pcdm.org/models#Object"
         ],
         ...
         "http://schema.org/author":[
            {
               "@id":"http://localhost:8000/user/1?_format=jsonld"
            }
         ],
         "http://schema.org/additionalType":[
            {
               "@id":"http://localhost:8000/taxonomy/term/8?_format=jsonld"
            },
            {
               "@id":"http://localhost:8000/taxonomy/term/1?_format=jsonld"
            }
         ]
         ...
      },
   ]
}

should become

{
   "@graph":[
      {
         "@id":"http://localhost:8000/node/1",
         "@type":[
            "http://pcdm.org/models#Object"
         ],
         ...
         "http://schema.org/author":[
            {
               "@id":"http://localhost:8000/user/1"
            }
         ],
         "http://schema.org/additionalType":[
            {
               "@id":"http://localhost:8000/taxonomy/term/8"
            },
            {
               "@id":"http://localhost:8000/taxonomy/term/1"
            }
         ]
         ...
      },
   ]
}

The change is easy enough, just delete the setRouteParameter bit in https://github.com/Islandora-CLAW/jsonld/blob/c23337950a0d310f6ae032c27557e04c171a8daf/src/Normalizer/ContentEntityNormalizer.php#L258. I'm proposing we do it there because I think this should apply to everything, and is really independent of the whole "Islandora flushing to Fedora" process.

There are downstream ramifications, though. We'll need to update Milliner to be aware of the difference. It looks for the _format=jsonld url in a handful of places, and it also uses the _format=jsonld url when indexing the resource in Gemini.

ajs6f commented 6 years ago

:+1:

On Jul 30, 2018, at 10:55 AM, dannylamb notifications@github.com wrote:

Right now, entity references have URLs pointing to their jsonld counterparts, but we should probably be writing out the canonical URL. The jsonld is an alternate representation of the node/media/term, but the triples are describing the canonical resource, independent of any other serialization.

In other words, with triples edited for brevity,

http://localhost:8080/fcrepo/rest/4f/eb/ac/c6/4febacc6-20c9-4e50-8077-30bf0e8b6da5

   ...
    schema:author          

http://localhost:8000/user/1?_format=jsonld ; schema:additionalType
http://localhost:8000/taxonomy/term/8?_format=jsonld ; schema:additionalType
http://localhost:8000/taxonomy/term/1?_format=jsonld ; ...

should become

http://localhost:8080/fcrepo/rest/4f/eb/ac/c6/4febacc6-20c9-4e50-8077-30bf0e8b6da5

   ...
    schema:author          

http://localhost:8000/user/1 ; schema:additionalType
http://localhost:8000/taxonomy/term/8 ; schema:additionalType
http://localhost:8000/taxonomy/term/1 ; ...

The change is easy enough, just delete the setRouteParameter bit in https://github.com/Islandora-CLAW/jsonld/blob/c23337950a0d310f6ae032c27557e04c171a8daf/src/Normalizer/ContentEntityNormalizer.php#L258. I'm proposing we do it there because I think this should apply to everything, and is really independent of the whole "Islandora flushing to Fedora" process.

There are downstream ramifications, though. We'll need to update Milliner to be aware of the difference. It looks for the _format=jsonld url in a handful of places, and it also uses the _format=jsonld url when indexing the resource in Gemini.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

DiegoPino commented 6 years ago

@dannylamb maybe missing a piece there so have a few questions:

So currently $url = $entity->toUrl('canonical', ['absolute' => TRUE]); gives you the canonical URL PHP Object for any entity that has one(not always the case).

The following line, the one your propose to remove, only adds the json_ld argument to the URL because it is supposed to reference the entity "as seen" in JSON-LD format too, not the webpage (with formatting, theming, etc) of the entity, it is like a wikipage v/s wikiitem thing. Not the same even if speaking about the same thing. It feels to me, semantically it is correct to use the JSON-LD representation's URL when dealing with triples between JSON-LD documents. So what is the need or why is this particular change needed?

If the difference is just the argument after the base URL, also, would it not easier, if you really need to remove it, to do that as a post process when parsing it on milliner? I can only see good reasons on keeping it as it right now, but i can be mistaken. Maybe i just need to be convinced.

dannylamb commented 6 years ago

@DiegoPino At a practical level, I don't think the data should be indexed in Gemini, the triplestore, and Fedora with ?_format=jsonld at the end of it. That can certainly be done as postprocessing, either in Milliner and the triplestore indexer. Or even as an alter.

But for me, semantically, my thinking is that the jsonld representation contains triples about the web resource, not triples about the jsonld representation of the web resource. Am I range 14'ing it here a bit?

DiegoPino commented 6 years ago

@dannylamb not sure what you mean with at a practical level, but interpret this/ guess the GET argument interferes with some preconceptions of what a valid URL should look like? Or with some processing needs? We can discuss it the URL is opaque or even semantically significant as an URL (true!) but at a w3c, rdf specs level with or without ?_format=jsonld both are totally valid IRI/URIs.

The issue i see with removing the ?_format=jsonld piece is that, not only web-semantically but also human-semantically, a published web resource, a.k.a an HTML rendered node can contain way more stuff that the entity/node described by the original json-ld. A Node can contain embedded (on display) the related media,viewers, tables of contents of other stuff, forms, comments, images, some blocks, etc.

So the real issue here is the lack of content negotiation that does not allow the same URL to output different serialization v/s display eyecandy depending on how you request the URL without having to use the argument. And that leads for me to a community question that we need to resolve: WE need a canonical web resource URL, not the one drupal provides but our own, unique, stable,consistent one.

Happens that incremental node ids (node/1) are an issue for any repository. Drupal 8 right now does not expose existing entity UUIDs as a autoload argument on its routing system (core), but i remember we got this working in CLAW on one of the first iterations some 2 years ago, so it is possible!

Since content negotiation is not build into by default but any EventSubscriberInterface derived class that is subscribed to Symfony HTTP Kernel events (like onRequest) can have access and set to the HEADER_CONTENT_TYPE and set a new response, it can (fun and easy!) generate a RedirectResponse() with a new destination URL!

So if we provide a real canonical in terms of reliable, consistent URI for our islandora resources (if there is such thing actually) then we could use content negotiation to redirect internally and generated the correct output solving this issue.

Can we open this discussion to other folks? I kinda find it fundamental to define what our canonical will be. Thanks!

seth-shaw-unlv commented 6 years ago

The way I conceive it is that the node URL is the 'thing' and should be mapped in Gemini and the triplestore; it 'happens' to return a default serialization of HTML (cause that is what people/browsers expects).

Yeah, the _format query parameter is a poor substitution for content negotiation (side-long glance at Drupal and Symphony). In addition to @DiegoPino's suggestion of generating a RedirectResponse, supposedly we could use an Apache rewrite condition to translate accept headers into the _format query parameter to plug this hole although I haven't tested it. There is a small chance Drupal will eventually support content negotiation again. Since there are a number of options that can be built on top of the base URL I wouldn't want to rely on the _format parameter for our URLs.

whikloj commented 6 years ago

Jumping in here to agree with both sides of the discussion, but to note that Drupal's use of _format is waaay hard coded in or at least a couple Drupal versions ago when I tried to work around it I had no fun and found no way to avoid it.

But Drupal != semantic datastore so perhaps we need to accept this is a lighter-weight repository and if you want a full (LDP|graph|whatever) implementation you could expose yours to the web?

dannylamb commented 6 years ago

FWIW, our REST api allows an informed client to at least find the _format routes from the canonical route. I'm not suggesting it as a general replacement for, or touting that it's better than, actual conneg, but all the available formats are advertised as "alternate" link headers on all GET and HEAD responses.

vagrant@claw:~$ curl -I http://localhost:8000/node/5
HTTP/1.1 200 OK
Date: Tue, 07 Aug 2018 14:47:22 GMT
Server: Apache/2.4.18 (Ubuntu)
X-Powered-By: PHP/7.0.30-0ubuntu0.16.04.1
Cache-Control: must-revalidate, no-cache, private
Link: <http://purl.org/coar/resource_type/c_c513>; rel="tag"; title="Image"
Link: <http://localhost:8000/media/13>; rel="related"; title="Preservation Master"
Link: <http://localhost:8000/media/14>; rel="related"; title="Service File"
Link: <http://localhost:8000/media/15>; rel="related"; title="Thumbnail"
Link: <http://localhost:8000/node/5?_format=jsonld>; rel="alternate"; type="application/ld+json"
Link: <http://localhost:8000/node/5?_format=json>; rel="alternate"; type="application/json"
Link: <http://localhost:8000/node/5>; rel="alternate"; hreflang="en"
Link: </node/5>; rel="canonical"
Link: </node/5>; rel="shortlink"
Link: </node/5>; rel="revision"
Link: </node?node=5>; rel="create"
X-Drupal-Dynamic-Cache: MISS
X-UA-Compatible: IE=edge
Content-language: en
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
Expires: Sun, 19 Nov 1978 05:00:00 GMT
Vary: 
X-Generator: Drupal 8 (https://www.drupal.org)
X-Drupal-Cache: MISS
Content-Type: text/html; charset=UTF-8

And I'll throw another option onto the table. We can embed the JSONLD in the html, so that they are one in the same and don't need separate urls. Then Google would actually be reading your metadata at least.

DiegoPino commented 6 years ago

JSON-LD in the html seems like and overkill. RDF is embeded in the html as rdfa already so solved for seo/discoverability purpouses i guess. So to be honest if you all really need to get rid of the format argument for whatever purpose that is needed i’m ok with it. Would just love if that would be done as an alter of the jsonld serialization from within islandora(thay thing you wrote) and not at the serializer level, at least until drupal has content negotiation

El El mar, 7 de ago. de 2018 a las 11:07, dannylamb < notifications@github.com> escribió:

FWIW, our REST api allows an informed client to at least find the _format routes from the canonical route. I'm not suggesting it as a general replacement for, or touting that it's better than, actual conneg, but all the available formats are advertised as "alternate" link headers on all GET and HEAD responses.

vagrant@claw:~$ curl -I http://localhost:8000/node/5 HTTP/1.1 http://localhost:8000/node/5HTTP/1.1 200 OK Date: Tue, 07 Aug 2018 14:47:22 GMT Server: Apache/2.4.18 (Ubuntu) X-Powered-By: PHP/7.0.30-0ubuntu0.16.04.1 Cache-Control: must-revalidate, no-cache, private Link: http://purl.org/coar/resource_type/c_c513; rel="tag"; title="Image" Link: http://localhost:8000/media/13; rel="related"; title="Preservation Master" Link: http://localhost:8000/media/14; rel="related"; title="Service File" Link: http://localhost:8000/media/15; rel="related"; title="Thumbnail" Link: http://localhost:8000/node/5?_format=jsonld; rel="alternate"; type="application/ld+json" Link: http://localhost:8000/node/5?_format=json; rel="alternate"; type="application/json" Link: http://localhost:8000/node/5; rel="alternate"; hreflang="en" Link: </node/5>; rel="canonical" Link: </node/5>; rel="shortlink" Link: </node/5>; rel="revision" Link: </node?node=5>; rel="create" X-Drupal-Dynamic-Cache: MISS X-UA-Compatible: IE=edge Content-language: en X-Content-Type-Options: nosniff X-Frame-Options: SAMEORIGIN Expires: Sun, 19 Nov 1978 05:00:00 GMT Vary: X-Generator: Drupal 8 (https://www.drupal.org) X-Drupal-Cache: MISS Content-Type: text/html; charset=UTF-8

And I'll throw another option onto the table. We can embed the JSONLD in the html, so that they are one in the same and don't need separate urls. Then Google would actually be reading your metadata at least.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/Islandora-CLAW/CLAW/issues/887#issuecomment-411090127, or mute the thread https://github.com/notifications/unsubscribe-auth/AGn8536NI88MiM_UkzcCd0Ypn9MTcVG4ks5uOa0XgaJpZM4VmkRu .

-- Diego Pino Navarro Digital Repositories Developer Metropolitan New York Library Council (METRO)

rosiel commented 5 years ago

A reminder that this will need to be reflected in documentation of the object model/data flow.

dannylamb commented 5 years ago

Testing Instructions

Setup

Confirm Prior Behaviour

Configure new behaviour

Confirm new behaviour

dannylamb commented 5 years ago

We'll follow all of this up with a claw playbook PR to make this configuration default. That way, folks can just get the new code and not be affected and configure it if they want to. But for new users, claw-playbook is just going to do this for them.

But these PRs can get merged in without the claw-playbook PR after we confirm both before and after behaviour.

Documentation PR pending as well...

dannylamb commented 5 years ago

@rosiel doc PR is up, too.

Just waiting on a successful vagrant build to make the claw-playbook PR...

dannylamb commented 5 years ago

https://github.com/Islandora-Devops/claw-playbook/pull/106 is up and ready to test. I had to move a bunch of branches over to the foundation repos so that I could get at them with ansible. I've also moved all the PRs over to point at those branches so we're merging what we're testing in case I push up more changes.

New Testing Instructions (only confirms new behaviour, not old)

After installing from https://github.com/Islandora-Devops/claw-playbook/pull/106:

dannylamb commented 5 years ago

OK @Islandora-CLAW/committers

https://github.com/Islandora-Devops/claw-playbook/pull/106 is ready to go. I've successfully installed using it and have a box with no more ?_format=jsonld.

whikloj commented 5 years ago

I have reviewed the Crayfish, Islandora, claw-playbook and ansible-role-crayfish PRs. Someone else needs to review the JsonLD one.

seth-shaw-unlv commented 5 years ago

I just approved the JSON-LD one. Are these getting merged in any particular order?

whikloj commented 5 years ago

@seth-shaw-unlv JSON-LD needs to happen first, then a version of it should be tagged, then the Islandora PR needs to be updated to reference that new version of JSON-LD. Crayfish is in, so that PR needs to be updated to reference the new version I tagged. (0.2.0 I think) then it can go in, then a new version of ansible-role-crayfish needs to be tagged and that needs to be referenced in claw-playbook.

seth-shaw-unlv commented 5 years ago

Alrighty then, I give you JSON-LD 0.2.0.

dannylamb commented 5 years ago

@seth-shaw-unlv @whikloj I'll update the islandora pull to reference it.

dannylamb commented 5 years ago

well... and all the other stuff too.

dannylamb commented 5 years ago

https://github.com/Islandora-Devops/ansible-role-crayfish/pull/23 and https://github.com/Islandora-CLAW/islandora/pull/134 have been updated to use 0.2.0 for both jsonld and crayfish.

When those are in, I'll update https://github.com/Islandora-Devops/claw-playbook/pull/106

dannylamb commented 5 years ago

Resolved via https://github.com/Islandora-Devops/claw-playbook/commit/0036464fa6ea2c54e1bcdbb5810179658410b8f7