Extending context in JSON-LD

dauwhe / epub31-bff

Straw-man spec for browser-friendly format for EPUB31

14 stars 3 forks source link

Extending context in JSON-LD #24

Open dauwhe opened 8 years ago

dauwhe commented 8 years ago

The proposal currently states:

All extensions would have to use full IRIs since additional context definition won't be allowed

Hadrien has said (in an email thread):

I strongly believe that we should disallow the use of @context for anything else than defining our own external context.

Ivan Herman replied (in an email thread):

Well… while disallowing the usage of @vocab might indeed be acceptable (if it can be reinforced), centralizing the @context the way you propose may backfire. The advantage of using JSON-LD is to provide flexibility to users or, let us say, user communities in terms of what vocabularies they would prefer to use. The bibliography terms used by the scholarly community are different than, say, the ONIX terms that trade publishers may want to use (or the schema version thereof); why not letting the respective communities use what they want (beyond some universally required terms that EPUB may define, of course)

dauwhe commented 8 years ago

Hadrien Gardeur replied (in an email thread):

My suggestion is to use full IRIs with all extensions because they provide what you say for RDF/JSON-LD compliant processors and can be easily understood by basic JSON processors too.

Main issue with letting people define their own @context, is that a number of different terms could be mapped to the same JSON key which basic JSON processors would not treat as different metadata. Full IRIs is a good extension mechanism for both situations.

dauwhe commented 8 years ago

Ivan Herman replied:

… and you take away one of the major advantage of JSON-LD, ie, the powerful way of making more readable terms and adapt to communities. I think that would be a mistake.

HadrienGardeur commented 8 years ago

I just want to point out that namespaces were a big issue for many tools and people in EPUB 2.0.1.

Fully opening the door to JSON-LD is potentially much worse in terms of impact, since the exact same JSON key could have completely different syntax and terms associated to it.

I see the ability to extract triples through JSON-LD as more of a "bonus" than a complete rethinking of how metadata are expressed in EPUB.

Through the use of full IRIs, we get the benefit of getting something that's extensible and won't be problematic for people who treat EPUB BFF as arbitrary JSON. It's not as compact a syntax for people who want to extend EPUB metadata, but it still works fine.

iherman commented 8 years ago

On 11 Mar 2016, at 18:14, Hadrien Gardeur notifications@github.com wrote:

I just want to point out that namespaces were a big issue for many tools and people in EPUB 2.0.1.

@context is not namespace. Many users in various places are pushed back by the "A:B" type notations, and this is exactly what @context may take care of by hiding it. Fully opening the door to JSON-LD is potentially much worse in terms of impact, since the exact same JSON key could have completely different syntax and terms associated to it.

First of all: @context is not only for that usage. It may simply include, ehem, ehem, namespaces, etc, assign datatypes to objects, etc.

As for the impact: we are not talking about the lambda user adding his/her own @context. We are talking about various communities adapting the various EPUB usages to their needs (I have referred to the ONIX versus, say, BIBO vocabulary usage differences.) Whoever will produce those @context files can take care of these. Closing the door on those communities is a bigger push back.

I see the ability to extract triples through JSON-LD as more of a "bonus" than a complete rethinking of how metadata are expressed in EPUB.

I do not understand this remark. Nor do I understand the relevance. Through the use of full IRIs, we get the benefit of getting something that's extensible and won't be problematic for people who treat EPUB BFF as arbitrary JSON. It's not as compact a syntax for people who want to extend EPUB metadata, but it still works fine.

Usage of full URI-s is incredibly error prone, knowing the complex URI-s that are sometimes in use, and, in my experience, many people hate to use them/type them all the time.

HadrienGardeur commented 8 years ago

I have serious doubts that everyone will look at all the existing context potentially used in EPUB if we open up that door (especially if you can define a local context). Which means that we'll most likely end up with common terms like "series" being defined differently by various groups.

@iherman you defend the use of other context, but I haven't yet seen a proposal on your side to mitigate some of the potential problems that I've mentioned before. Full IRIs and a single context are one way to deal with it, but there are of course other ways. We could for example have a registry of external context and vet them somehow which would mitigate this problem quite a bit. What do you have in mind?

IMO, local context are a potential nightmare for some of the reasons listed above, but I'm open to other alternatives for external context.

dauwhe commented 8 years ago

Hadrien wrote:

Main issue with letting people define their own @context, is that a number of different terms could be mapped to the same JSON key which basic JSON processors would not treat as different metadata.

Here's a very simple document:

{
"foo": "spam",
"bar": "eggs"
}

We can add a context:

{
"@context": {
  "foo": "http://www.example.com/foo",
"bar": "http://www.example.com/bar"
},
"foo": "spam",
"bar": "eggs"
}

which a JSON-LD processor will render:

{
  "http://www.example.com/bar": "eggs",
  "http://www.example.com/foo": "spam"
}

If we do something bad with the context:

{
"@context": {
  "foo": "http://www.example.com/foo",
"bar": "http://www.example.com/foo"
},
"foo": "spam",
"bar": "eggs"
}

A JSON-LD processor will of course see this very differently:

{
  "http://www.example.com/foo": [
    "eggs",
    "spam"
  ]
}

Is this what you're concerned about?

Dave

HadrienGardeur commented 8 years ago

Well that's one example but not the only one. What you describe can be a big issue if people are allowed to define local context, since they could make the situation pretty bad. For example, one could overload all the terms that we've defined in metadata and redefine them as something completely different.

But what I was thinking about is a little different: what happens when multiple groups end up using the same JSON key but with a different syntax or semantics associated that key.

Let's say that we have three groups defining "series". Group A says that the "series" key is just a literal, then group B says that "series" is to provide a position in the series, while group C says that "series" MUST be an object.

We end up with the following three examples:

{
  "series": "Harry Potter"
}

{
  "series": 2
}

{
  "series": {
    "name": "Harry Potter",
    "position": 2
  }
}

As long as you're using a system that understands JSON-LD properly, it shouldn't be too much of an issue. But for an app that treat those as arbitrary JSON it could be very confusing.

If group C is not capable of truly supporting JSON-LD in their app, they might have a pretty big surprise when they encounter group A or C usage of that key.

iherman commented 8 years ago

On 11 Mar 2016, at 18:35, Hadrien Gardeur notifications@github.com wrote:

I have serious doubts that everyone will look at all the existing @context https://github.com/context potentially used in EPUB if we open up that door (especially if you can define a local @context https://github.com/context). Which means that we'll most likely end up with common terms like "series" being defined differently by various groups.

@iherman https://github.com/iherman you defend the use of other context, but I haven't yet seen a proposal on your side to mitigate some of the potential problems that I've mentioned before. Full IRIs and a single context are one way to deal with it, but there are of course other ways. We could for example have a registry of external context and vet them somehow which would mitigate this problem quite a bit. What do you have in mind?

It is not a matter of a specific 'proposal'; I am afraid it is a matter of different appreciation of the problem. I obviously understand the issue and I acknowledge that the risk exists of having two different communities using the same terms and thereby ending up in a term confusion. I just do not consider that as a danger so big and so serious that would justify to throw away the flexibility offered by the usage of @context-s and impose an unfriendly approach instead.

I repeat myself, I believe: we are not talking about people randomly adding various contexts and vocabularies to their metadata. As far as I can see, the industry has the habit of 'centralizing' the vocabularies in use anyway, and smaller publishers will seek to use the same vocabularies that bigger publishers or well defined communities use (this is a bit different than many of the data publisher communities out there, which seems to be much more distributed and diffuse). In other words, I expect, realistically, 5-6 different publishing communities coming up with, or reusing, their own vocabularies, let that be education, scholarly communications, manga, children's book, magazines, legal publications, you name it. IDPF defining a 'core' set of terms (with an IDPF "@context"), and then these communities adding their own terms using their own terms with, say, education specific context, seems to me the most flexible and user friendly way of moving forward. Forcing the users of educational publishers to use long URI-s for their specific terms seems to be utterly unfriendly and actually extremely error prone to me (it is very easy to misspell a long URL…).

You referred to the registration of "@context". I am a little bit concerned of an over-administered process that often makes registration procedures heavy. But having a list of vocabularies and contexts that communities use, published at some well known place, will also mitigate the danger. (W3C maintains these days a http://www.w3.org/ns/ http://www.w3.org/ns/ directory to host namespace documents and, increasingly, JSON-LD contexts; something like that can be set up.)

Mistakes will be made, and some publications will have problems. The Web has proven to be pretty resilient to similar problems so far… why would that be different?

I am afraid we have to agree that we disagree.

HadrienGardeur commented 8 years ago

I'm not sure that we entirely disagree Ivan.

You said:

In other words, I expect, realistically, 5-6 different publishing communities coming up with, or reusing, their own vocabularies, let that be education, scholarly communications, manga, children's book, magazines, legal publications, you name it. IDPF defining a 'core' set of terms (with an IDPF "@context"), and then these communities adding their own terms using their own terms with, say, education specific context, seems to me the most flexible and user friendly way of moving forward.

I previously suggested that instead of having a single external context, we could create an IDPF registry that would list multiple external contexts that would all be allowed in EPUB BFF. This doesn't have to be over-administered, as such registries are not part of the specification itself, we could make sure that it's easy enough to update them.

This is IMO a way of balancing things out between your concerns and mine:

IDPF has a main external context for EPUB BFF that covers roughly the equivalent of EPUB 3.0.1 and maybe some modules
other communities can create their own context and have it registered at an official IDPF registry
anyone can also extend the metadata by using full IRIs
local context definition or external context using a URI not listed in the IDPF registry are forbidden

iherman commented 8 years ago

On 14 Mar 2016, at 12:30, Hadrien Gardeur notifications@github.com wrote:

I'm not sure that we entirely disagree Ivan.

Not entirely... You said:

In other words, I expect, realistically, 5-6 different publishing communities coming up with, or reusing, their own vocabularies, let that be education, scholarly communications, manga, children's book, magazines, legal publications, you name it. IDPF defining a 'core' set of terms (with an IDPF "@context https://github.com/context"), and then these communities adding their own terms using their own terms with, say, education specific context, seems to me the most flexible and user friendly way of moving forward.

I previously suggested that instead of having a single external context, we could create an IDPF registry that would list multiple external contexts that would all be allowed in EPUB BFF. This doesn't have to be over-administered, as such registries are not part of the specification itself, we could make sure that it's easy enough to update them.

This is IMO a way of balancing things out between your concerns and mine:

IDPF has a main external context for EPUB BFF that covers roughly the equivalent of EPUB 3.0.1 and maybe some modules other communities can create their own context and have it registered at an official IDPF registry anyone can also extend the metadata by using full IRIs local context definition or external context using a URI not listed in the IDPF registry are forbidden

I think this is the only point where I disagree, at least to some extend

To 'forbid' external contexts using a URI makes only sense only if there is an efficient means to check this. Unless we build in a full JSON-LD parser into the EPUB checker, that is difficult to do. So, in standard parlance, I would say "SHOULD NOT" and not "MUST NOT". That is a big difference.

(B.t.w., if we really have a smart checker, which is managing JSON-LD, that checker could also check whether there are overlaps in terms and issue either an error or a warning!)

I would go even further with the local context definition (ie, context without a URI). That is very very local to a data, and I can see very genuine usages. For example, a local @context could define a prefixes, allowing the data to use CURIE-s instead of full URI-s for the local extensions. It could set language preferences for some tags, which is certainly very useful. In other words, I would allow local @context definitions, making it clear that users/publishers have to be very careful with what they do and do say which are the features that SHOULD NOT be done in that local context

HadrienGardeur commented 8 years ago

The current EPUB check will need to be updated in some pretty major ways to support EPUB BFF, full support of JSON-LD could be part of it.

For overlaps in terms, it's not that simple since you won't necessarily see them on a single publication. Having a controlled list of context in a registry make things much easier in that regard, since you could check against them.

For local context definition, I understand some of the use cases that you're pointing to, but it still feels like a "nuclear bomb" threat to me given what could go wrong. I don't think that a list of SHOULD NOT is enough, we should either forbid it or be very restrictive in what it's allowed for.

iherman commented 8 years ago

On 14 Mar 2016, at 13:06, Hadrien Gardeur notifications@github.com wrote:

The current EPUB check will need to be updated in some pretty major ways to support EPUB BFF, full support of JSON-LD could be part of it.

Well, if so, a JSON-LD based checker can check the issues that you are referring to by comparing the content of @context files, so your problem would be solved! :-) For overlaps in terms, it's not that simple since you won't necessarily see them on a single publication. Having a controlled list of context in a registry make things much easier in that regard, since you could check against them.

For local context definition, I understand some of the use cases that you're pointing to, but it still feels like a "nuclear bomb" threat to me given what could go wrong. I don't think that a list of SHOULD NOT is enough, we should either forbid it or be very restrictive in what it's allowed for.

As I said: we do disagree on some points. In practice, I think this is not a nuclear bomb, maybe just a firecracker...

HadrienGardeur commented 8 years ago

It would be solved only if it has a list of well-known context to compare it with (hence the need for a registry).

iherman commented 8 years ago

On 14 Mar 2016, at 19:12, Hadrien Gardeur notifications@github.com wrote:

It would be solved only if it has a list of well-known context to compare it with (hence the need for a registry).

The system could check whether the data includes two different context files with overlapping terms. If the data is, in this sense, consistent with itself, then there is no real issue...

But we should stop this. We are getting to details of no real consequence.