mandatory absolute URI for anchor

dret / I-D

Internet Drafts I've authored or contributed to.

16 stars 13 forks source link

mandatory absolute URI for anchor #117

Closed hvdsomp closed 4 years ago

hvdsomp commented 5 years ago

For a link in the HTTP Link header, the following holds:

If the link has no anchor, then the URI of the resource that delivered the HTTP Link header (responding resource) is the anchor
If the link has an anchor and its value is a relative URI, then the URI of the resource that delivered the HTTP Link header (responding resource) is the baseURI for the anchor

Applying the above to a linkset yields:

If a link in a linkset has no anchor, then the URI of the link set resource (resource that delivered the linkset) is the anchor
If a link in a linkset has an anchor and its value is a relative URI, then the URI of the link set resource (resource that delivered the linkset) is the baseURI for the anchor

The current linkset I-D has explicit cautionary language with this regard, anticipating that the above behavior is most likely not what implementers would want to achieve.

In addition, contrary to a typical use of links in an HTTP header (follow your nose during an HTTP navigation session), links in a link set may be used in a standalone manner, meaning disconnected from the link set resource that - as per the above - is supposed to provide URI/baseURI for anchors. In such standalone uses, the information about the link set resources that provided the linkset may no longer be available. As such, it would be good if linksets would be self-contained, i.e. be explicit with regard to what the anchor of each link is.

The above tries to make the point that it would be beneficial to simultaneously:

Avoid the risk of misinterpretation of link anchors, which is warned for by explicit language in the I-D
Make linksets self-contained regarding link anchors

I see two possible approaches:

Require that each link in a linkset has an anchor with an absolute URI
In serializations, require an information element similar to BASE in HTML to express a BaseURI for link anchors

Note that (2) could be achieved in JSON, especially when following proposal #103 by @BigBlueHat, which is the plan. But, it could not be achieved for the application/linkset serialization since it is a direct mapping of the Link header syntax.

dret commented 5 years ago

it seems this one currently has been resolved to require that anchors MUST be absolute. that makes sense as explained by @hvdsomp, but introduces a processing model that should be made very explicit in the spec: for serializing into the media types, all links must be parsed, and all anchors must be resolved to absolute URIs.

dret commented 5 years ago

On 2019-01-30 14:06, Herbert Van de Sompel wrote:

I see two possible approaches:

Require that each link in a linkset has an anchor with a complete (non-relative) URI

In serializations, require an information element similar to BASE in HTML to express a BaseURI for link anchors

Note that (2) could be achieved in JSON, especially when following proposal #103 https://github.com/dret/I-D/issues/103 by @BigBlueHat https://github.com/BigBlueHat, which is the plan. But, it could not be achieved for the application/linkset serialization since it is a direct mapping of the Link header syntax.

i think there is a third possibility to say that links with non-absolute anchors MUST be ignored if there is no well-defined context for the linkset. we could remain agnostic as to how this context is established. this would have the advantage of not creating different data models for native and JSON, and to still allow full round-trip fidelity from link headers to linksets and back.

hvdsomp commented 5 years ago

I was working on trying to get this "ignore relative URIs" into a new version of the I-D, and I am afraid it just does not make sense to me. I explained the reasons already in the above, but I'll repeat:

Standalone: Link sets will be used in standalone manners, be merged with other linkset, etc. So, one way or another, for these link sets to be usable, the contained links will eventually need to be expressed with absolute URIs for anchors and hrefs.
Confusion: Following existing conventions, if a link in a linkset has no anchor, then the URI of the link set resource (resource that delivered the linkset) is the anchor. And, most likely, that is not what is intended. Most likely what is intended is that the URI of the origin resource is supposed to be the anchor.
Confusion: Following existing conventions, if a link in a linkset has an anchor and its value is a relative URI, then the URI of the link set resource (resource that delivered the linkset) is the baseURI for the anchor. And, most likely, that is not what is intended. Most likely what is intended is that URI is relative to that of the origin resource.
Confusion: Following existing conventions, if a link in a linkset has an href with as value a relative URI, then the URI of the link set resource (resource that delivered the linkset) is the baseURI for the href. And, most likely, that is not what is intended. Most likely what is intended is that URI is relative to that of the origin resource.

As such, the confusion introduced by allowing relative URIs as such is rather significant and definitely error prone. But additionally, by allowing relative URIs, we are basically requiring the client to do the following:

If you know "somehow" know the baseURL to which the URIs are relative, write them out as absolute URIs yourself, because we know you will need to store the links in absolute URI terms.
If you don't know the baseURL, ignore all relative URIs.

Based on the above, I am not sure what the "allow relative URIs and ignore them" adds as value. Agreed that relative URIs are meaningful when they are part of an HTTP interaction when there is no doubt about what their baseURL is. But link sets will be used outside of HTTP interactions and not providing absolute URIs leads to confusion and more work for the client.

stain commented 5 years ago

I think allowing relative URIs will open up a minefield which we know developers are not ver ygood at handling (and particularly bad at generating).

I see the point of this linkset media type is precisely to decouple the expression of the links from the resource that have the links.

If the anchors (and targets) are allowed to be relative it opens for many confusion points:

Relative to the original resource, or relative to the linkset? All Linked Data and Web practices for base URI (e.g. as in CSS) says it should be the second, but developers may expect the first ("hey, you sent me here!")
False assumptions based on accidentally relative co-location - e.g. both in same folder, both use /bar - but a later move breaks those clients
Possibly duplicates with different relative paths (./foo vs foo vs /foo vs //example.com/foo vs http://example.com/foo) -- how are these nodes merged or looked up? Sounds like you need a full graph store! (This applies for both text and JSON variants)
Can extension relation types be relative URIs?
Does the relative URI <> ("" in JSON) mean the linkset itself?

I think it should be possible to use a linkset file detached from its origin - e.g. save it to disk or do a simple lookup - without having to further process other HTTP headers (beyond Content-Type) to interpret it - e.g. having to deal with Content-Location on the linkset resource.

So I think a core principle should be that a linkset file is possible to save and reuse standalone without having to process anything first. This would be the case for both application/linkset and application/linkset+json variants.

In that sense the motivation is the same as the simplicity of the N-Triples format where absolute URIs can be treated as opaque strings.

Counter arguments

sorry, have to be devil's advocate here..

Always having absolute URIs in the linkset means the client has to be more diligent of recording the URI they want to look up, e.g. http://example.com/foo vs http://www.example.com/foo vs https://www.example.com/foo vs https://www.example.com/foo.html vs https://www.example.com/foo.html?query vs https://www.example.com/foo.html?query#baz

Always using absolute URIs means the clients always have to calculate the absolute URI of the requested document (after redirections) rather than guessing because of neighbouring folders. In practice, check the Content-Location according to rfc7231 section 3.1.4.1 so they can find it again in the linkset. However they have to do that anyway also if relative URIs are allowed as you don't know if absolute URIs are used or not.

It might be a HTTP server wants to read a neighbouring linkset file (e.g. ./.linkset.json) to produce the Link headers. In this case it would be much better if the paths are relative so they work well however it is served. Same argument for someone storing a linkset file outside the web, e.g. GitHub repo.

Q: Is http://example.com/foo/../bar/ a valid absolute URI?

Parsing as JSON-LD

it depends if we say the URIs "SHOULD" be absolute or MUST be absolute. In the first case normal JSON-LD parsing with the Content-Location etc of the linkset will work well. In the second case a conforming parser should instead parse it with @context: { .., "@base": "invalid:///"} to avoid accidental relative URIs. These can then be filtered out from the triples.

dret commented 5 years ago

On 2019-08-01 07:40, Stian Soiland-Reyes wrote:

I see the point of this linkset media type is precisely to decouple the expression of the links from the resource that have the links.

the point of the media type is to be as faithful as possible to the link header model and format, and to allow round-tripping without introducing complex processing rules. we should simply say that relative URIs are meaningless without a context and should be ignored unless the context can be preserved/determined in some shape or form.

dret commented 5 years ago

On 2019-08-01 05:45, Herbert Van de Sompel wrote:

Standalone: Link sets will be used in standalone manners, be merged with other linkset, etc. So, one way or another, for these link sets to be usable, the contained links will eventually need to be expressed with absolute URIs for anchors and hrefs.

not if the context is preserved. how that's done is not for us to define or determine, but linksets with relative URIs are perfectly fine standalone when the context URI is preserved.

Confusion: Following existing conventions, if a link in a linkset has no anchor, then the URI of the link set resource (resource that delivered the linkset) is the anchor. And, most likely, that is not what is intended. Most likely what is intended is that the URI of the origin resource is supposed to be the anchor.

true. i am not sure how to best deal with this. no matter how strong we state that this interpretation is wrong, people will still do it.

Confusion: Following existing conventions, if a link in a linkset has an anchor and its value is a relative URI, then the URI of the link set resource (resource that delivered the linkset) is the baseURI for the anchor. And, most likely, that is not what is intended. Most likely what is intended is that URI is relative to that of the origin resource.

that's the same as the above, right?

Confusion: Following existing conventions, if a link in a linkset has an href with as value a relative URI, then the URI of the link set resource (resource that delivered the linkset) is the baseURI for the href. And, most likely, that is not what is intended. Most likely what is intended is that URI is relative to that of the origin resource.

that's the same as the above, right?

As such, the confusion introduced by allowing relative URIs as such is rather significant and definitely error prone. But additionally, by allowing relative URIs, we are basically requiring the client to do the following:

If you know "somehow" know the baseURL to which the URIs are relative, write them out as absolute URIs yourself, because we know you will need to store the links in absolute URI terms.

no need to write them out. just standard URI resolution. but yes, you need to know the URI to resolve against.

If you don't know the baseURL, ignore all relative URIs.

yes.

Based on the above, I am not sure what the "allow relative URIs and ignore them" adds as value.

allowing the format to be roundtrip-able and not adding a processing model that no doubt will be ignored by some implementations as well.

Agreed that relative URIs are meaningful when they are part of an HTTP interaction when there is no doubt about what their baseURL is. But link sets will be used outside of HTTP interactions and not providing absolute URIs leads to confusion and more work for the client.

you just shift the work differently, right? either require work to resolve/filter upfront, or do it later. and in any case, even if we allow relative, implementations would still be free to resolve if they want to.

in fact, maybe that could be a good way to address this conundrum: have a section on relative URIs and their issues. then add one sub-section on what that means for direct reuse (context needs to be preserved or relative URIs need to be updated), and one sub-section on resolving all URIs and what that means (no more roundtrip-ability, allowing linksets to be standalone).

hvdsomp commented 5 years ago

The point made several times above is that the relative links will not be ignored because clients will process them the way they are used to process links. Hence the repeated use of "Confusion" above.

Regarding "faithful": When mandating absolute URIs the media type remains faithful to the link header model. Link headers can represent absolute URIs.
Regarding round tripping: The round tripping that may be required is between documents in application/linkset and documents in application/linkset+json. There is no need to roundtrip between links in a Link header and links in link set documents. Because those links are provided by different parties: the origin resource and the link set resource, respectively. The former in the context of an HTTP interaction (hence relative URIs are OK and interpreted relative to the origin's URL), the latter not (hence relative URIs are not OK because they must be interpreted relative to the link set resource, which is most likely what is not intended).
Regarding "shifting the work": Indeed. But the link set resource is all about providing links. It's its sole reason of being. It was invented for that purpose; it's a new entity in the ecology. So, let it do the work rather telling existing entities to behave differently and ignore relative URIs.

dret commented 5 years ago

On 2019-08-01 08:44, Herbert Van de Sompel wrote:

Regarding "shifting the work": Indeed. But the link set resource is all about providing links. It's its sole reason of being. It was invented for that purpose; it's a new entity in the ecology. So, let it do the work rather telling existing entities to behave differently and ignore relative URIs.

it's not about being lazy. it's about documenting semantics. if you process application/linkset, you know the media type. if the media type tells you to be mindful of relative URIs, that's what you need to do: only resolve those if you know the original context, and ignore them otherwise. that's perfectly well-defined and in fact allows more lazy behavior (which is typically what happens in real life regardless of what specs are trying to say).

asked the other way around: if you categorically disallow relative URIs (which makes linkset incongruent with link headers), what rule do you define if there is one in a linkset? and keep in mind that regardless of what we are discussing here, you will find plenty of those in the wild, because people are lazy.

hvdsomp commented 5 years ago

If one finds a relative URI in a format that looks a lot like other document formats with links, lazy as one is, one will do what is typically done in those cases: use the URL of the responding resource as base for the link. And hence end up with a whole bunch of unintended links. This will happen because parties that consume these links will feel they don’t need to read the spec. The link set documents, in both formats, are rather self explanatory from a consumption perspective, after all.

In order to avoid the misinterpretation of links with relative URIs, define the format so not to allow relative URIs. Parties that publish link sets will need to read the spec whichever way (what’s intuitive to consume is not intuitive to create, definitely not for the JSON) and hence will learn they need to write absolute URIs. Which prevents the above problem from happening. If it happens anyhow, the document is in violation of the format spec.

I explained in a previous comment why I think the “lack of congruence” argument is IMO a red herring.

On Aug 1, 2019, at 17:51, Erik Wilde notifications@github.com wrote:

On 2019-08-01 08:44, Herbert Van de Sompel wrote:

Regarding "shifting the work": Indeed. But the link set resource is all about providing links. It's its sole reason of being. It was invented for that purpose; it's a new entity in the ecology. So, let it do the work rather telling existing entities to behave differently and ignore relative URIs.

it's not about being lazy. it's about documenting semantics. if you process application/linkset, you know the media type. if the media type tells you to be mindful of relative URIs, that's what you need to do: only resolve those if you know the original context, and ignore them otherwise. that's perfectly well-defined and in fact allows more lazy behavior (which is typically what happens in real life regardless of what specs are trying to say).

asked the other way around: if you categorically disallow relative URIs (which makes linkset incongruent with link headers), what rule do you define if there is one in a linkset? and keep in mind that regardless of what we are discussing here, you will find plenty of those in the wild, because people are lazy. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

dret commented 5 years ago

On 2019-08-01 09:22, Herbert Van de Sompel wrote:

If one finds a relative URI in a format that looks a lot like other document formats with links, lazy as one is, one will do what is typically done in those cases: use the URL of the responding resource as base for the link. And hence end up with a whole bunch of unintended links. This will happen because parties that consume these links will feel they don’t need to read the spec. The link set documents, in both formats, are rather self explanatory from a consumption perspective, after all.

all understood, but my question was what you'd define in the spec about processing relative URIs if you disallow them. semantics are undefined? they must be ignored? the whole linkset must be ignored?

the point is: undoubtedly these linkset will exist en masse, if we say "linkset is like a link header, but not quite". assuming that everybody will resolve/normalize is unrealistic. defining what to do would be useful. what would you define?

In order to avoid the misinterpretation of links with relative URIs, define the format so not to allow relative URIs.

i do understand this. but we're not on a green field here. we started out with the mission to define a media type for link header field payloads, and not something new.

Parties that publish link sets will need to read the spec whichever way (what’s intuitive to consume is not intuitive to create, definitely not for the JSON) and hence will learn they need to write absolute URIs. Which prevents the above problem from happening. If it happens anyhow, the document is in violation of the format spec.

but now what? these things will exist en masse, even if you tell people that when they do what they will do that they do the wrong thing.

I explained in a previous comment why I think the “lack of congruence” argument is IMO a red herring.

what's red herring about it? i'd like to be able to simply store the contents of a link header field. you're telling me i can't do that. i think that's a discussion that can be had.

hvdsomp commented 5 years ago

You state: we started out with the mission to define a media type for link header field payloads, and not something new.

That is really incorrect. We started on a mission to provide links by reference (instead of by value in the Link header of the origin resource) and provide motivations in the I-D why this is valuable. And we decided to use the format of the Link header as the basis, because that’s what exists. And then we started understanding that there is something different about providing links by reference:

relative URIs can be misinterpreted
sets of links will be used in standalone manners, disconnected from HTTP interactions

On Aug 1, 2019, at 18:34, Erik Wilde notifications@github.com wrote:

On 2019-08-01 09:22, Herbert Van de Sompel wrote:

If one finds a relative URI in a format that looks a lot like other document formats with links, lazy as one is, one will do what is typically done in those cases: use the URL of the responding resource as base for the link. And hence end up with a whole bunch of unintended links. This will happen because parties that consume these links will feel they don’t need to read the spec. The link set documents, in both formats, are rather self explanatory from a consumption perspective, after all.

all understood, but my question was what you'd define in the spec about processing relative URIs if you disallow them. semantics are undefined? they must be ignored? the whole linkset must be ignored?

the point is: undoubtedly these linkset will exist en masse, if we say "linkset is like a link header, but not quite". assuming that everybody will resolve/normalize is unrealistic. defining what to do would be useful. what would you define?

In order to avoid the misinterpretation of links with relative URIs, define the format so not to allow relative URIs.

i do understand this. but we're not on a green field here. we started out with the mission to define a media type for link header field payloads, and not something new.

Parties that publish link sets will need to read the spec whichever way (what’s intuitive to consume is not intuitive to create, definitely not for the JSON) and hence will learn they need to write absolute URIs. Which prevents the above problem from happening. If it happens anyhow, the document is in violation of the format spec.

but now what? these things will exist en masse, even if you tell people that when they do what they will do that they do the wrong thing.

I explained in a previous comment why I think the “lack of congruence” argument is IMO a red herring.

what's red herring about it? i'd like to be able to simply store the contents of a link header field. you're telling me i can't do that. i think that's a discussion that can be had. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

dret commented 4 years ago

done.