`^` char breaks URL parsing

NicolaIsotta commented 1 year ago

Apache NetBeans version

Apache NetBeans 18

What happened

Remote CSS is not shown if its url contains a ^

How to reproduce

Add this to an html page:

<link rel="stylesheet" href="https://unpkg.com/primeflex@^3/primeflex.css"/>

Did this work correctly in an earlier version?

No / Don't know

Operating System

Windows 10 version 10.0 running on amd64; Cp1252; it_IT (nb)

JDK

11.0.17; OpenJDK 64-Bit Server VM 11.0.17+8

Apache NetBeans packaging

Apache NetBeans binary zip

Anything else

stack trace

SEVERE [org.openide.util.RequestProcessor]: Error in RequestProcessor org.netbeans.modules.navigator.NavigatorController$1
java.net.URISyntaxException: Illegal character in path at index 28: https://unpkg.com/primeflex@^3/primeflex.css
    at java.base/java.net.URI$Parser.fail(URI.java:2913)
    at java.base/java.net.URI$Parser.checkChars(URI.java:3084)
    at java.base/java.net.URI$Parser.parseHierarchical(URI.java:3166)
    at java.base/java.net.URI$Parser.parse(URI.java:3114)
    at java.base/java.net.URI.(URI.java:600)
    at java.base/java.net.URL.toURI(URL.java:1061)
    at org.openide.filesystems.FileObject.toURI(FileObject.java:1270)
Caused: java.lang.IllegalStateException
    at org.openide.filesystems.FileObject.toURI(FileObject.java:1275)
    at org.netbeans.modules.navigator.ProviderRegistry.getProviders(ProviderRegistry.java:98)
    at org.netbeans.modules.navigator.NavigatorController.obtainProviders(NavigatorController.java:593)
    at org.netbeans.modules.navigator.NavigatorController.access$200(NavigatorController.java:76)
    at org.netbeans.modules.navigator.NavigatorController$1.run(NavigatorController.java:391)
    at org.openide.util.RequestProcessor$Task.run(RequestProcessor.java:1419)
    at org.netbeans.modules.openide.util.GlobalLookup.execute(GlobalLookup.java:45)
    at org.openide.util.lookup.Lookups.executeWith(Lookups.java:287)
[catch] at org.openide.util.RequestProcessor$Processor.run(RequestProcessor.java:2034)

Are you willing to submit a pull request?

No

matthiasblaesing commented 1 year ago

The problem is, that you did not specify an URI, at least to my understanding. I had a look at RFC 3986 (Uniform Resource Identifier (URI): Generic Syntax). My understanding is, that the problematic part is parsed as:

-> URI (https://unpkg.com/primeflex@^3/primeflex.css) -> hier-part (//unpkg.com/primeflex@^3/primeflex.css) -> path-abempty (/primeflex@^3/primeflex.css) -> segment (primeflex@^3)

Below the segment is no matching construction. segment is defined to be a list of pchar elements, which can be::

unreserved (ALPHA / DIGIT / "-" / "." / "_" / "~")
pct-encoded ("%" HEXDIG HEXDIG)
sub-delims ("!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "=")
:
@

none of these match "^".

TL;DR: The browsers are wrong to accept this as an URL and not reject it.

The good news: If you use a valid URL it still works: https://unpkg.com/primeflex@%5E3/primeflex.css

matthiasblaesing commented 1 year ago

Please indicate, if this helps.

NicolaIsotta commented 1 year ago

I've tested other browsers/IDEs and they seem to automatically encode the URI. Here's a VS Code screenshot for example: Maybe encoding the string according to RFC3986 before creating the URI can avoid this kind of exceptions?

matthiasblaesing commented 1 year ago

Sorry, I don't see a sane way to do this. When would you encode a character and when not? You can argue, that you can guess that a not allowed character gets encoded, others don't, but then what do you make from this:

https://test.invalid/path%20with%20spaces

When trying to guess if this has to be decoded, the path component might mean /path%20with%20spaces or /path with spaces. On the other hand encoded it might also be https://test.invalid/path%2520with%2520spaces.

As you might have guessed from my reply, I don't like this "let`s interpret the most broken code until it works somehow" attitude in web development. If someone can answer this problem, without security problems, I'll review a fix, until that happens from my POV this works as designed.

Chris2011 commented 1 year ago

As far as I know browser also encode url parts like space to %20 and so on and if you put this URL with encoded space again to the addressbar it will not encode it again. Yes, it is not like the designed stuff but we all know that there is stuff that could help which was not designed before. This is just a better developer experience and of course I see your point. Should this be better than a designed RFC which can't handle this? I would say it depends. Also maybe they forgot to add ^. I would just make it better than the RFC. My 2 cents.

Chris2011 commented 1 year ago

I also had a quick look into the RFC but I just saw the regex in Appendix B and I'm not familar with RFCs a lot but when I check the regex with the given URL, there is no problem parsing it: https://regex101.com/r/S3E5BM/1 and it matches tha part after the tld correctly in one part. Yes the regex is seems more generic and less errornous.

neilcsmith-net commented 1 year ago

I think it's a valid URL according to the HTML spec - bit old but see this https://www.w3.org/TR/2011/WD-html5-20110525/urls.html#parsing-urls which specifically mentions that character.

matthiasblaesing commented 1 year ago

Ok fine. So HTML even made it written, that it deliberately breaks existing specifications, invalidating existing tools. Great they just went down another notch on my respect scale. Lets reopen and see if anyone is willing to fix this mess and write an "HTML URL" to "real URL" translator to handle these cases.

neilcsmith-net commented 1 year ago

Looks like there might be a few library options that could handle this (eg. OkHttp HttpUrl) ?

@matthiasblaesing yes, this bit is great! :grimacing:

The term "URL" in this specification is used in a manner distinct from the precise technical meaning it is given in RFC 3986. Readers familiar with that RFC will find it easier to read this specification if they pretend the term "URL" as used herein is really called something else altogether. This is a willful violation of RFC 3986.

matthiasblaesing commented 1 year ago

https://github.com/smola/galimatias might be alternative. We use it already in the context of the httpparser/validator.

apache / netbeans