adoptium / adoptium-support

For end-user problems reported with our binary distributions
Apache License 2.0
46 stars 15 forks source link

Euro currency sign in URI path is not encoded as per JavaDoc but as per IRI #637

Closed pekuz closed 1 year ago

pekuz commented 1 year ago

Please provide a brief summary of the bug

Somehow related to #611

new URI("http", "host", "/11€", null)

Javadoc reads:

 *   <li><p><a id="encode"></a> A character is <i>encoded</i> by replacing it
 *   with the sequence of escaped octets that represent that character in the
 *   UTF-8 character set.  The Euro currency symbol ({@code '\u005Cu20AC'}),
 *   for example, is encoded as {@code "%E2%82%AC"}.  <i>(<b>Deviation from
 *   RFC&nbsp;2396</b>, which does not specify any particular character
 *   set.)</i> </p></li>

but toString() returns:

http://host/11€

which resembles IRI.

Please provide steps to reproduce where possible

No response

Expected Results

JavaDoc is updated to match the observed behaviour, likely the IRI RFC.

Actual Results

There discrepancy between JavaDoc and observed behaviour.

What Java Version are you using?

jdk-17.0.5+8

What is your operating system and platform?

No response

How did you install Java?

No response

Did it work before?

No response

Did you test with other Java versions?

Java 11

Relevant log output

No response

karianna commented 1 year ago

@pekuz - I'm not 100% sure but I think toString may just be converting the character encoding back to the Euro symbol for display. I'm not sure this is incorrect behaviour?

pekuz commented 1 year ago

By URI JavaDoc, the euro currency sign should be encoded as "%E2%82%AC", an US-ASCII string, so safe for almost any terminal. In reality Java URI does not encode the euro currency sign.

For me, the observed behaviour is pretty usable too, it's closer to RFC 3987 so I proposed to consider to treat this issue as the URI JavaDoc bug rather than the URI implementation bug.

FYI, UTF-8 encoded java source input hex format read:

22 2F 31 31 E2 82 AC 22  "/11€"

so on the input there was the euro currency sign (E2 82 AC), despite I have not used the Java \u escape sequence.

pekuz commented 1 year ago

Actual behaviour is:

    assertEquals("http://host/11\u20AC", new URI("http", "host", "/11\u20AC", null).toString());
    assertEquals("http://host/11%E2%82%AC", new URI("http", "host", "/11\u20AC", null).toASCIIString());

So the class-level JavaDoc paragraph on the encoding applies to toASCIIString(). That is, as documented, a deviation from RFC 2396. So it will interoperate only with platforms that opted for the exactly same deviation.

Now, if toString() follows any RFC or other widely accepted standard, could it be documented, please?

karianna commented 1 year ago

https://bugs.openjdk.org/browse/JDK-8298064

karianna commented 1 year ago

@pekuz are you able to take a look at the Oracle's comments on the ticket? It sounds like it is by design but they are willing to alter the Javadoc (but are hoping for some guidance)

https://bugs.openjdk.org/browse/JDK-8298064

pekuz commented 1 year ago

The idea is to determine if the documented behaviour implements some later and/or wider standard such as RFC or we are left with "as per URI JavaDoc" so interoperable in Java-Java scope.

A bit hard so OK for leaving it at backburner.