boostorg / url

Boost.URL is a library for manipulating Uniform Resource Identifiers (URIs) and Locators (URLs).
https://www.boost.org/doc/libs/master/libs/url/index.html
Boost Software License 1.0
186 stars 50 forks source link

Review: explore IRIs #441

Open alandefreitas opened 2 years ago

alandefreitas commented 2 years ago

The lack of IRI support is unfortunate. It's 2022; we should all be writing software with Unicode support by default. However, this can be built on top of Boost.URL, and isn't needed in all cases.

What are the use-cases for IRI?

From my point of view, all I can do is study IRIs more.

alandefreitas commented 2 years ago

all I can do is study IRIs more

More and more I've been noticing how this very true. I can't even evaluate if some of the comments on IRIs make sense. I can only answer them by studying IRIs.

alandefreitas commented 2 years ago

I read a little more about IRIs and I don't think we should touch this until all else is stable.

Some changes we would need:

Design:

petrepircalabu commented 1 year ago

I have a question related to the use cases of IRIs. Are there any requirements in the HTTP related RFCs (9112 and complementary) that a server MUST support non-ascii request targets without percent encoding? e.g.: When using curl to send a request for a non-ascii resource the target is not percent encoded but an UTF-8 string:

`curl -x 10.22.65.19:7074 -vvv "http://10.17.13.30:8084/files/교회-요양원-모임고리로감염확산"

alandefreitas commented 1 year ago

I don't think so. Requiring a server to accept some kind of URL is equivalent to saying the server must contain a given resource. If you have no resource to associate with 교회-요양원-모임고리로감염확산, there's nothing to support here. Curl tends to accept loose inputs because it's a producer that can talk to any server. But I don't know if it's converting these inputs to something the server should understand or just passing the URL through. Everything usually tends to be fine when the input is converted to some proper percent escaped URL.

I think there are two opinions on this.

One supposed advantage of the first point of view is that we would have more strict parsers over time because servers wouldn't need to handle loose input bad producers are generating. So servers could eliminate workarounds over time.

The second point of view has the advantage that bad producers and consumers are iteratively kicked out of the ecosystem until everyone complies correctly with proper input and no ambiguous variations.

For Boost.URL, in my opinion, the second point of view seems more reasonable considering the use cases it serves, which is usually machine-to-machine communication. For instance,

If we consider a browser address bar as the most common use case, where it's user-to-machine communication, then Boost.URL would need to accept and sanitize a lot of invalid input. Now there's no exact protocol to follow because humans can fail in lots of different ways. But that's bad for machine-to-machine communication even when it works. If Boost.URL handled that use case by default, then servers could now be routing invalid URLs to resources without meaning to and clients could also be making requests to invalid URLs.

That doesn't mean Boost.URL can't help users produce valid input. For instance, that's what urls::format is meant to do.