Dynamically generate JSON-LD context files based on used ontologies

Current usage of default JSON-LD context

Currently the default JSON-LD context (passed to LDP and ActivityPub services) is used to format (or more precisely "frame") the rough results returned by Jena Fuseki, whenever no JsonLdContext header is passed.

Developer convenience

Having propretly formatted JSON-LD is a convenience for developers, when they browse through a LDP container.

Instead of having full URI, they can see prefixes (this also applies to Turtle format).

Instead of having @id for every URI, there is something more readable.

If unformatted JSON-LD was returned, browser extensions like Header Editor, combined with the new JsonLdContext header, could however help developers see proper formatting.

Moleculer services

It is also useful to pass formatted data between other Moleculer services. Moleculer services can use the jsonContext parameter of the ldp.resource.get action if they want to get the results framed according to the context they want. But for Moleculer events (like ldp.resource.created), we are dependent on what the LDP service emitted.

If we used rough Fuseki results, there would be some consistency also. Or better yet: expanded results, so that we are not dependant on the formatting of a particular triple store.

More generally, in all Moleculer services, we should not treat data as JSON but as RDF, and find a library to properly process data, no matter the context used.

ActivityPub federation

That's the real problem: Most ActivityPub-compatible servers treat data as JSON and don't reformat it. They generally tolerate the addition of other contexts (this is considered as the proper way to create extensions), but if you pass rough JSON-LD data, they will most likely not reframe it.

This is a problem not only for activities sent between federated servers, but also potentially for resources ("objects" in ActivityPub vocabulary) that are retrieved from the LDP server. In the ActivityPub spec, it is indicated that "Implementers SHOULD include the ActivityPub context in their object definitions. Implementers MAY include additional context as appropriate.".

One solution could be to include the ActivityStreams context in the default JSON-LD context (especially when the ActivityPub service is activated) and to ignore other contexts. Or to provide a context which fits with core ontologies (like LDP), and ignore app-specific contexts.

Proposed solution

Allow to define core ontologies
- For ActivityPods, this will not include app-specific ontologies like PAIR
- Applications will be able to use the JsonLdContext header to get the format they need.
Add a jsonldContext field to the ontologies definition
- Must be an URL, or can also be an object ?
- Warning: some context like https://www.w3.org/ns/activitystreams.jsonld include multiple ontologies. It is up to the developer to ensure there is no conflict between them (maybe we could do a check on startup)
Build the default context from the provided ontologies (no more need to pass a jsonContext param to LdpService)
Split ontologies per file (put them in package ?)
Use a library like ldo to avoid being dependant of the context passed.
Preload core ontologies in the JsonLdService

Finally, allowing to register ontologies dynamically will be needed because, for ActivityPods 2.0, we want containers to be automatically created with the prefix of the ontology. So instead of having hard-coded core ontologies, we will create a new OntologiesService that we will be available by any service to register ontologies. They will be persisted on the settings dataset. It will be possible to pass a list of ontologies to this service to register them on start. Here are the actions that will be available:

register(prefix, url, owl, jsonldContext, overwrite=false) Register a new ontology. On start, this action will be called for all core ontologies. If overwrite is false, the action will return an error if an ontology with this prefix/URL already exist, otherwise it will overwrite it.
findPrefix(url) Returns the prefix, based on prefix.cc API (see https://github.com/assemblee-virtuelle/activitypods/issues/128). If not found on prefix.cc, returns nothing.
list() Returns an array of registered ontologies (cached)
get(prefix) Return a single ontology based on the prefix, nothing if no ontology match (cached)
getRdfPrefixes() Returns the list of ontologies to be used in SPARQL queries (cached)
getJsonLdContext() Returns the JSON-LD context. Put together the jsonldContext of ontologies, if they are available, or otherwise just add the prefix of the ontology. (cached)

The jsonldContext on the register function can be an URL or an object/array. In the case of an object/array, they will be JSON-stringified on persistance. The register function should fail if the JSON-LD context is in conflict with existing contexts.

On ActivityPods, ontologies that are registered dynamically by external appliactions will use the prefix from prefix.cc and not pass any OWL file or JSON-LD context, as we can do without that.

Use a library like ldo to avoid being dependant of the context passed.

I like that idea - it has the positive side effect of adding typing support which reduces bugs and improves developer experience.

Warning: some context like https://www.w3.org/ns/activitystreams.jsonld include multiple ontologies. It is up to the developer to ensure there is no conflict between them (maybe we could do a check on startup)

I'm not sure if I understand that correctly. Json-ld would go with overriding less recent definitions, if there are multiple, unless a @protected keyword is set (https://www.w3.org/TR/json-ld/#protected-term-definitions). So putting the ActivityStreams or ActivityPods context last should be fine with regard to framing AS-properties. Is that what you mean?

Maybe I don't quite understand the use case of the issue yet. So the idea is that if no JsonLdContext header is passed upon an ldp request, the service method getJsonLdContext() is called to generate the context? Would that require an endpoint to be generated for each iteration of the json context (e.g. https://mypod.store/ontologies/context-XYZ.jsonld)?

And could we add a default context value for a specific resource or container which is used if no JsonLdContext header is passed? E.g. this would be convenient for AP collections and objects.

I'm not sure if I understand that correctly. Json-ld would go with overriding less recent definitions, if there are multiple, unless a @protected keyword is set (https://www.w3.org/TR/json-ld/#protected-term-definitions). So putting the ActivityStreams or ActivityPods context last should be fine with regard to framing AS-properties. Is that what you mean?

I've been writing tests for that today, and indeed it seems validation fails only when the @protected keyword is used. But I'm pretty sure I came accross other kind of conflicts when compacting JSON-LD data, I need to dig this deeper.

Maybe I don't quite understand the use case of the issue yet. So the idea is that if no JsonLdContext header is passed upon an ldp request, the service method getJsonLdContext() is called to generate the context?

Yes exactly ! In ActivityPods, this will replace the https://activitypods.org/context.json context, since this is not scalable.

Would that require an endpoint to be generated for each iteration of the json context (e.g. https://mypod.store/ontologies/context-XYZ.jsonld)?

It could be interesting to provide such an endpoint (mostly for frontend apps). Not sure what path to use though. This makes me realize that every Pod should, in theory, have its own JSON-LD context, since it depends on the applications that were installed... But that's not how I went with the implementation so far (the ontologies are saved on the general settings dataset, not on the Pod). This will requires some thoughts :thinking:

And could we add a default context value for a specific resource or container which is used if no JsonLdContext header is passed? E.g. this would be convenient for AP collections and objects.

ActivityStreams will necessarily be in the core ontologies, so its context will always be included. We will use an array of contexts instead of putting everything together like we do now in the ActivityPods context file. Something like this:

  "@context": [
    "https://www.w3.org/ns/activitystreams",
    {
      "ldp": "http://www.w3.org/ns/ldp#",
      ...
    }
  ]

Thanks for the remarks!

For a moment I was thinking if we could get around creating custom contexts. And if it was a good idea to have each resource have some kind of ex:defaultJsonContext value which would be set from the @context field when a resource is created with a POST with content-type ld+json. But this value would be unset for example if the resource was created with content type turtle and brings us back to the question of which context to use..

[...] every Pod should, in theory, have its own JSON-LD context, since it depends on the applications that were installed... But that's not how I went with the implementation so far (the ontologies are saved on the general settings dataset, not on the Pod). This will requires some thoughts 🤔

I see several options for storing these informations:

In the user's dataset instead of the settings dataset, and keep the urn:-type link. They will only be accessible through SPARQL with a webId system, but that seems OK as it is something internal.
On a custom resource linked to the Application Registration. So if the application unregisters, it will also remove the ontology. We could store only the prefix, as we don't want to bother now about JSON-LD contexts, but it could be extended for JSON-LD contexts later. The problem here would mostly be to handle potential conflicts, either with core ontologies or with other app-specific ontologies.
In a pim:PreferencesFile as this is a Solid standard. But I haven't yet found a description of how informations should be stored in these files.
In Redis. However until now we have avoided storing critical informations in Redis so maybe it's better to keep it like this.

In the last option, we should avoid persisting core ontologies, and use instead the array passed to the LdpOntologiesService. This could be a good idea for other options as well, so that we don't need to store (and maintain/migrate) triples that will be replicated in all datasets.

I see several options for storing these informations

From what you describe, options 2 and 3 seem to be most convincing to me, since they appear to be more "transparent" about what's happening from the outside and are a bit more generalizable / closer to the specs..

I think I'm mixing too many problems. Application-defined ontologies are really needed at the moment only for the LDP containers path generation, and we don't need to have something perfect because this is not standard and we don't know if we will keep this in the long run.

The choice of the prefix is really an internal implementation matter that has little impact on the functionning of the Pod. Other implementations could use LOV or custom prefixes databases (the general philosophy is that the containers path is not a problem, and we don't really care about it). However what we need is consistency, so that, if two applications use the same ontology, the same prefix will be used for their containers. That's why we need persistence, but it doesn't matter if this is all persisted in the same dataset (the settings dataset).

What we also want is clean contexts which explicitely include the ActivityStreams context. If we put all the properties directly on the context, it will add a big ugly header and increase the response size. So a solution could be to put all these custom context properties in an pod-provider-level context file (accessible via GET), like we do on other SemApps instances with the /context.json file, except it will be dynamically generated.

Save used prefixes on the settings dataset
Add an option to persist or not the ontologies (in which case the register action would be disabled).
Generate a custom context file on the fly with all non-URI contexts.
When we call getJsonLdContext, return the URI JSON-LD contexts (AS context...) and this custom context file.
When we need to create a new LDP container, first look if the ontology is registered. If not, query prefix.cc and register the ontology prefix. It will thus be added to the system-wide ontologies, so that Turtle and JSON results look nicer.

I'm also in the process of splitting the ontologies service with a new jsonld.context service, so the result will be a bit different that the above proposal.

assemblee-virtuelle / semapps