assemblee-virtuelle / semapps

A toolbox to create semantic web applications
https://semapps.org
Apache License 2.0
88 stars 9 forks source link

Dynamically generate JSON-LD context files based on used ontologies #1205

Closed srosset81 closed 11 months ago

srosset81 commented 1 year ago

Current usage of default JSON-LD context

Currently the default JSON-LD context (passed to LDP and ActivityPub services) is used to format (or more precisely "frame") the rough results returned by Jena Fuseki, whenever no JsonLdContext header is passed.

Developer convenience

Having propretly formatted JSON-LD is a convenience for developers, when they browse through a LDP container.

Instead of having full URI, they can see prefixes (this also applies to Turtle format).

Instead of having @id for every URI, there is something more readable.

If unformatted JSON-LD was returned, browser extensions like Header Editor, combined with the new JsonLdContext header, could however help developers see proper formatting.

Moleculer services

It is also useful to pass formatted data between other Moleculer services. Moleculer services can use the jsonContext parameter of the ldp.resource.get action if they want to get the results framed according to the context they want. But for Moleculer events (like ldp.resource.created), we are dependent on what the LDP service emitted.

If we used rough Fuseki results, there would be some consistency also. Or better yet: expanded results, so that we are not dependant on the formatting of a particular triple store.

More generally, in all Moleculer services, we should not treat data as JSON but as RDF, and find a library to properly process data, no matter the context used.

ActivityPub federation

That's the real problem: Most ActivityPub-compatible servers treat data as JSON and don't reformat it. They generally tolerate the addition of other contexts (this is considered as the proper way to create extensions), but if you pass rough JSON-LD data, they will most likely not reframe it.

This is a problem not only for activities sent between federated servers, but also potentially for resources ("objects" in ActivityPub vocabulary) that are retrieved from the LDP server. In the ActivityPub spec, it is indicated that "Implementers SHOULD include the ActivityPub context in their object definitions. Implementers MAY include additional context as appropriate.".

One solution could be to include the ActivityStreams context in the default JSON-LD context (especially when the ActivityPub service is activated) and to ignore other contexts. Or to provide a context which fits with core ontologies (like LDP), and ignore app-specific contexts.

Proposed solution

srosset81 commented 12 months ago

Finally, allowing to register ontologies dynamically will be needed because, for ActivityPods 2.0, we want containers to be automatically created with the prefix of the ontology. So instead of having hard-coded core ontologies, we will create a new OntologiesService that we will be available by any service to register ontologies. They will be persisted on the settings dataset. It will be possible to pass a list of ontologies to this service to register them on start. Here are the actions that will be available:

The jsonldContext on the register function can be an URL or an object/array. In the case of an object/array, they will be JSON-stringified on persistance. The register function should fail if the JSON-LD context is in conflict with existing contexts.

On ActivityPods, ontologies that are registered dynamically by external appliactions will use the prefix from prefix.cc and not pass any OWL file or JSON-LD context, as we can do without that.

Laurin-W commented 12 months ago

Use a library like ldo to avoid being dependant of the context passed.

I like that idea - it has the positive side effect of adding typing support which reduces bugs and improves developer experience.

Warning: some context like https://www.w3.org/ns/activitystreams.jsonld include multiple ontologies. It is up to the developer to ensure there is no conflict between them (maybe we could do a check on startup)

I'm not sure if I understand that correctly. Json-ld would go with overriding less recent definitions, if there are multiple, unless a @protected keyword is set (https://www.w3.org/TR/json-ld/#protected-term-definitions). So putting the ActivityStreams or ActivityPods context last should be fine with regard to framing AS-properties. Is that what you mean?

Maybe I don't quite understand the use case of the issue yet. So the idea is that if no JsonLdContext header is passed upon an ldp request, the service method getJsonLdContext() is called to generate the context? Would that require an endpoint to be generated for each iteration of the json context (e.g. https://mypod.store/ontologies/context-XYZ.jsonld)?

And could we add a default context value for a specific resource or container which is used if no JsonLdContext header is passed? E.g. this would be convenient for AP collections and objects.

srosset81 commented 12 months ago

I'm not sure if I understand that correctly. Json-ld would go with overriding less recent definitions, if there are multiple, unless a @protected keyword is set (https://www.w3.org/TR/json-ld/#protected-term-definitions). So putting the ActivityStreams or ActivityPods context last should be fine with regard to framing AS-properties. Is that what you mean?

I've been writing tests for that today, and indeed it seems validation fails only when the @protected keyword is used. But I'm pretty sure I came accross other kind of conflicts when compacting JSON-LD data, I need to dig this deeper.

Maybe I don't quite understand the use case of the issue yet. So the idea is that if no JsonLdContext header is passed upon an ldp request, the service method getJsonLdContext() is called to generate the context?

Yes exactly ! In ActivityPods, this will replace the https://activitypods.org/context.json context, since this is not scalable.

Would that require an endpoint to be generated for each iteration of the json context (e.g. https://mypod.store/ontologies/context-XYZ.jsonld)?

It could be interesting to provide such an endpoint (mostly for frontend apps). Not sure what path to use though. This makes me realize that every Pod should, in theory, have its own JSON-LD context, since it depends on the applications that were installed... But that's not how I went with the implementation so far (the ontologies are saved on the general settings dataset, not on the Pod). This will requires some thoughts :thinking:

And could we add a default context value for a specific resource or container which is used if no JsonLdContext header is passed? E.g. this would be convenient for AP collections and objects.

ActivityStreams will necessarily be in the core ontologies, so its context will always be included. We will use an array of contexts instead of putting everything together like we do now in the ActivityPods context file. Something like this:

  "@context": [
    "https://www.w3.org/ns/activitystreams",
    {
      "ldp": "http://www.w3.org/ns/ldp#",
      ...
    }
  ]
Laurin-W commented 12 months ago

Thanks for the remarks!

For a moment I was thinking if we could get around creating custom contexts. And if it was a good idea to have each resource have some kind of ex:defaultJsonContext value which would be set from the @context field when a resource is created with a POST with content-type ld+json. But this value would be unset for example if the resource was created with content type turtle and brings us back to the question of which context to use..

srosset81 commented 12 months ago

[...] every Pod should, in theory, have its own JSON-LD context, since it depends on the applications that were installed... But that's not how I went with the implementation so far (the ontologies are saved on the general settings dataset, not on the Pod). This will requires some thoughts 🤔

I see several options for storing these informations:

In the last option, we should avoid persisting core ontologies, and use instead the array passed to the LdpOntologiesService. This could be a good idea for other options as well, so that we don't need to store (and maintain/migrate) triples that will be replicated in all datasets.

Laurin-W commented 12 months ago

I see several options for storing these informations

From what you describe, options 2 and 3 seem to be most convincing to me, since they appear to be more "transparent" about what's happening from the outside and are a bit more generalizable / closer to the specs..

srosset81 commented 12 months ago

I think I'm mixing too many problems. Application-defined ontologies are really needed at the moment only for the LDP containers path generation, and we don't need to have something perfect because this is not standard and we don't know if we will keep this in the long run.

The choice of the prefix is really an internal implementation matter that has little impact on the functionning of the Pod. Other implementations could use LOV or custom prefixes databases (the general philosophy is that the containers path is not a problem, and we don't really care about it). However what we need is consistency, so that, if two applications use the same ontology, the same prefix will be used for their containers. That's why we need persistence, but it doesn't matter if this is all persisted in the same dataset (the settings dataset).

What we also want is clean contexts which explicitely include the ActivityStreams context. If we put all the properties directly on the context, it will add a big ugly header and increase the response size. So a solution could be to put all these custom context properties in an pod-provider-level context file (accessible via GET), like we do on other SemApps instances with the /context.json file, except it will be dynamically generated.

I'm also in the process of splitting the ontologies service with a new jsonld.context service, so the result will be a bit different that the above proposal.