Infinite or very long loop in URL validation URL

sshort commented 8 years ago

Using ESAPI to validate URL with the default regex in the properties file causes some URLs to loop for a very long time, while hitting high, e.g. 30% CPU usage.

Call validator.isValidInput with the following URL:

"http://mvcrzaufyh.youtube.com.hwq19b2dycja1red.youtube.com.25sq0l9.1cslt00rcg49.youtube.com.84116406.controversialtopic.net.//6js73/Q151E3//...//Q151E3//96093547/9863552/684833.bvmi?cCsRPsbcs2zDcvjJhczcvScwcdN6jcdMt"

Example stack trace:

stack_trace.txt

xeno6696 commented 8 years ago

This is why you don't want to use regex to "validate" URLs. I think you've discovered a worst-case input.

xeno6696 commented 8 years ago

Also, what version of ESAPI, and could you please give us the full method call as well as the exact regex rule backed by validation.properties?

sshort commented 8 years ago

Agreed - I wouldn't choose to use a regex for this. Using ESAPI 2.1.0.

Regex: ^(?:(?:https?|ftp):\\/\\/)(?:\\S+(?::\\S*)?@)?(?:(?!(?:)(?:\\.\\d{1,3}){3})(?!(?:169\\.254|192\\.168)(?:\\.\\d{1,3}){2})(?!172\\.(?:1[6-9]|2\\d|3[0-1])(?:\\.\\d{1,3}){2})(?:[1-9]\\d?|1\\d\\d|2[01]\\d|22[0-3])(?:\\.(?:1?\\d{1,2}|2[0-4]\\d|25[0-5])){2}(?:\\.(?:[1-9]\\d?|1\\d\\d|2[0-4]\\d|25[0-4]))|(?:(?:[a-z\\u00a1-\\uffff0-9]-*)*[a-z\\u00a1-\\uffff0-9]+)(?:\\.(?:[a-z\\u00a1-\\uffff0-9]-*)*[a-z\\u00a1-\\uffff0-9]+)*(?:\\.(?:[a-z\\u00a1-\\uffff]{2,})))(?::\\d{2,5})?(?:\\/\\S*)?$

xeno6696 commented 8 years ago

I cannot reproduce this, @kwwall I ran the following unit-test in ValidatorTest:

assertTrue(instance.isValidInput("test", "http://mvcrzaufyh.youtube.com.hwq19b2dycja1red.youtube.com.25sq0l9.1cslt00rcg49.youtube.com.84116406.controversialtopic.net.//6js73/Q151E3//...//Q151E3//96093547/9863552/684833.bvmi?cCsRPsbcs2zDcvjJhczcvScwcdN6jcdMt" , "URL", 2030, false));

All tests run in less than a second.

xeno6696 commented 8 years ago

The regex backing in the default validation.properties file is this:

^(ht|f)tp(s?)\\:\\/\\/[0-9a-zA-Z]([-.\\w]*[0-9a-zA-Z])*(:(0-9)*)*(\\/?)([a-zA-Z0-9\\-\\.\\?\\,\\:\\'\\/\\\\\\+=&%\\$#_]*)?$

xeno6696 commented 8 years ago

Yeah @sshort the problem is the regex. Where did you get it from?

xeno6696 commented 8 years ago

I have verified that the validation.properties file in ESAPI tag 2.1.0.1 is identical to the one used in esapi tag 2.1.0. The problem is related to the custom regex. Will close as this problem isn't due to ESAPI, but in the regex used.

sshort commented 8 years ago

Not sure - I inherited this project. I thought I tracked it back to ESAPI itself but I shall try it out with the default regex you mention above.

Thanks for the fast responses.

sdey1977 commented 8 years ago

Below regular expression fails to validate the URL http://core-jenkins.scansafe.cisco.com/佐贺诺伦-^ńörén.jpg appropriately. Returns FALSE after around 30911 ms (OS -X El Capitan 10.11.5,core i7 2.2 GHz ) ^(ht|f)tp(s?)\\:\\/\\/[0-9a-zA-Z]([-.\\w]*[0-9a-zA-Z])*(:(0-9)*)*(\\/?)([a-zA-Z0-9\\-\\.\\?\\,\\:\\'\\/\\\\\\+=&%\\$#_]*)?$

This is the default one from validation.properties.

Please advice.

xeno6696 commented 8 years ago

Per the URI specification:

In local or regional contexts and with improving technology, users might benefit from being able to use a wider range of characters; such use is not defined by this specification. Percent-encoded octets (Section 2.1) may be used within a URI to represent characters outside the range of the US-ASCII coded character set if this representation is allowed by the scheme or by the protocol element in which the URI is referenced. Such a definition should specify the character encoding used to map those characters to octets prior to being percent-encoded for the URI.

The URI you've provided uses non-ascii characters and is therefore outside the bounds of RFC-3986. This isn't something we will support.

If you percent-encoded the URL, like this: http%3A%2F%2Fcore-jenkins.scansafe.cisco.com%2F%E4%BD%90%E8%B4%BA%E8%AF%BA%E4%BC%A6-%5E%C5%84%C3%B6r%C3%A9n.jpg

This would meet RFC-3986, and would be supported.

xeno6696 commented 8 years ago

To underline that the use case is unsupported, run the following program:

package org.owasp;

import java.net.URI;

public class URITest {

    public static void main(String[] args) throws Exception {
        // TODO Auto-generated method stub
        URI uri = new URI("http://core-jenkins.scansafe.cisco.com/佐贺诺伦-^ńörén.jpg");
    }

}

You should get the following exception:

Exception in thread "main" java.net.URISyntaxException: Illegal character in path at index 44: http://core-jenkins.scansafe.cisco.com/佐贺诺伦-^ńörén.jpg
    at java.net.URI$Parser.fail(URI.java:2848)
    at java.net.URI$Parser.checkChars(URI.java:3021)
    at java.net.URI$Parser.parseHierarchical(URI.java:3105)
    at java.net.URI$Parser.parse(URI.java:3053)
    at java.net.URI.<init>(URI.java:588)
    at org.owasp.URITest.main(URITest.java:9)

xeno6696 commented 8 years ago

I should learn to read before posting. I missed this part @sdey1977 and I'm sorry.

appropriately. Returns FALSE after around 30911 ms

The problem isn't that we're not supporting an expanded character set, it's that the URI provided is a worst-case input for the regex engine.

xeno6696 commented 8 years ago

Added the following test case to ValidatorTest

assertFalse(instance.isValidInput("test", "http://core-jenkins.scansafe.cisco.com/佐贺诺伦-^ńörén.jpg", "URL", 100, false));

Current execution time:

xeno6696 commented 8 years ago

Switching all the quantifiers in the regex nets us a new result:

diff:

-Validator.URL=^(ht|f)tp(s?)\\:\\/\\/[0-9a-zA-Z]([-.\\w]*[0-9a-zA-Z])*(:(0-9)*)*(\\/?)([a-zA-Z0-9\\-\\.\\?\\,\\:\\'\\/\\\\\\+=&amp;%\\$#_]*)?$
+Validator.URL=^(ht|f)tp(s?+)\\:\\/\\/[0-9a-zA-Z]([-.\\w]*[0-9a-zA-Z])*(:(0-9)*)*(\\/?+)([a-zA-Z0-9\\-\\.\\?\\,\\:\\'\\/\\\\\\+=&amp;%\\$#_]*)?+$

xeno6696 commented 8 years ago

@kwwall we should take this one seriously, even after doing the standard speed up, this bad regex could cause a DoS on any application that relies upon ESAPI for URI validation. I think its worth a point-release by itself.

Is there a reason why we never used the regex directly from the original URI specification? When I play with that one in a regex debugger, I get extremely fast results.

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

xeno6696 commented 8 years ago

Ah, well the first problem with that regex is that it accepts non-ascii characters.

kwwall commented 8 years ago

IMHO, validation of URLs should not really be solely rely on regex in the first place. Expecting to do this accurately and efficiently with regex alone is nuts.

Secondly, where we use regex matching in general, we really probably ought to do something like spin off a separate thread to do it and have the parent thread kill it (and treat it as a failed match) if the regex match doesn't complete with in some small amount of time (which could be flexible). That's about the only sure way I know of to prevent ReDoS vulnerabilities.

-kevin Sent from my Droid; please excuse typos. On Jun 5, 2016 10:48 AM, "Matt Seil" notifications@github.com wrote:

Ah, well the first problem with that regex is that it accepts non-ascii characters.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ESAPI/esapi-java-legacy/issues/376#issuecomment-223817162, or mute the thread https://github.com/notifications/unsubscribe/AB3nmzF1uSF4hpMDfKXVJQq_rU7RoaoSks5qIuHAgaJpZM4IhN8s .

kwwall commented 8 years ago

And yes, I know you really aren't supposed to create threads from within a JavaEE managed container, bit what else can you do? I can't think of any other approach that might work.

-kevin Sent from my Droid; please excuse typos. On Jun 5, 2016 11:07 AM, "Kevin W. Wall" kevin.w.wall@gmail.com wrote:

IMHO, validation of URLs should not really be solely rely on regex in the first place. Expecting to do this accurately and efficiently with regex alone is nuts.

Secondly, where we use regex matching in general, we really probably ought to do something like spin off a separate thread to do it and have the parent thread kill it (and treat it as a failed match) if the regex match doesn't complete with in some small amount of time (which could be flexible). That's about the only sure way I know of to prevent ReDoS vulnerabilities.

-kevin Sent from my Droid; please excuse typos. On Jun 5, 2016 10:48 AM, "Matt Seil" notifications@github.com wrote:

Ah, well the first problem with that regex is that it accepts non-ascii characters.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ESAPI/esapi-java-legacy/issues/376#issuecomment-223817162, or mute the thread https://github.com/notifications/unsubscribe/AB3nmzF1uSF4hpMDfKXVJQq_rU7RoaoSks5qIuHAgaJpZM4IhN8s .

xeno6696 commented 8 years ago

That's exactly what I wanted to hear. I started writing code to add a URI validation method to the default validator class, my plan is to delegate validation ultimately to the java.net.URI class, but then also get some intrusion detection by canonicalizing all the sub-parts of the URI. This will actually close several older issues in the ESAPI backlog.

Once you get the canonicalized URL string, then you can use far simpler regexs against it to further punt out things we don't like. (Like a whitelist of schemes.)

This will more or less remove regex as any kind of a bottleneck and avoid spinning up threads in a J2EE environment.

xeno6696 commented 8 years ago

@kwwall

I wanted to let you know, I wasn't aware that the class UriBuilder wasn't a stock java class... it's part of javax.ws.rs.core.UriBuilder which is only from J2EE 6+ and not part of any [default] javax package, which would make sense from well... a language that dominates enterprise web applications. Kinda disappointed....

I would like to use that class to construct a canonicalized URI, however I'm going to have to handroll the construction of the canonicalized URL string in order to prevent any extra dependencies. No big deal... just more work than I'm used to. (It could be worse... it could be C...)

sshort commented 8 years ago

Hi guys,

what's the forecast for a fix on this? Should we look at other solutions as this is a production issue for us?

Thx Steve

xeno6696 commented 8 years ago

If it's a prod issue, then alter the regex to something that makes sense for your application(s) in question that works faster. The problem here ultimately isn't ESAPI, it's java's regex engine and the fact that you found a worst-case input for the regex we chose. Really, great job on that!

That said, the path I'm going down will NOT use the same API, so even if I pushed a fix today, you would have to rewrite all of your applications to use the new API. URLs have always been a special case and they really should never have been treated with regexs as the primary solution.

@kwwall what would you say about switching the default regex to ^(ht|f)tp(s?)\\p{Print}*$ which will limit to only http/s, ftp/s, and printable ASCII characters? As long as mixed and multi encoding is turned on, the input should still get scanned by canonicalize after the validations are completed, which is really the only other threat.

kwwall commented 8 years ago

I'd have to think about it, and get feedback from the user community. This regex exists in other places too, like the antisamy-esapi.xml file. I'm reluctant to change the default, especially to something more liberal as that has potential security implications. (The potential ReDoS issue does too, but it's far from clear that every use of Validator.URL is exploitable in that manner.) Personally, at this point, I'm more inclined to not do that, but just add a comment referring to this issue in validation.properties file until we have that better, non-regex API that you are working on. Also, keep in mind, it is expected that developers tweak validation.properties. That's probably the route I'd currently recommend for @sshort.

@planetlevel & @manicode - Care to weigh in? I think this is mostly code you guys wrote.

-kevin Sent from my Droid; please excuse typos. On Jun 8, 2016 10:52 AM, "Matt Seil" notifications@github.com wrote:

If it's a prod issue, then alter the regex to something that makes sense for your application(s) in question that works faster. The problem here ultimately isn't ESAPI, it's java's regex engine and the fact that you found a worst-case input for the regex we chose. Really, great job on that!

That said, the path I'm going down will NOT use the same API, so even if I pushed a fix today, you would have to rewrite all of your applications to use the new API. URLs have always been a special case and they really should never have been treated with regexs as the primary solution.

@kwwall https://github.com/kwwall what would you say about switching the default regex to ^(ht|f)tp(s?)\p{Print}*$ which will limit to only http/s, ftp/s, and printable ASCII characters? As long as mixed and multi encoding is turned on, the input should still get scanned by canonicalize after the validations are completed, which is really the only other threat.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ESAPI/esapi-java-legacy/issues/376#issuecomment-224614421, or mute the thread https://github.com/notifications/unsubscribe/AB3nm6tFP2HefErzg0oYd8B1ZF2zFG12ks5qJtcSgaJpZM4IhN8s .

kwwall commented 8 years ago

Oh, forgot to mention concern if people have it mixed / multi-encoding disabled. The new proposed regex itself has no way of "knowing" that they should be on. Just not sure how common it is to disable them.

-kevin Sent from my Droid; please excuse typos. On Jun 8, 2016 10:52 AM, "Matt Seil" notifications@github.com wrote:

If it's a prod issue, then alter the regex to something that makes sense for your application(s) in question that works faster. The problem here ultimately isn't ESAPI, it's java's regex engine and the fact that you found a worst-case input for the regex we chose. Really, great job on that!

That said, the path I'm going down will NOT use the same API, so even if I pushed a fix today, you would have to rewrite all of your applications to use the new API. URLs have always been a special case and they really should never have been treated with regexs as the primary solution.

@kwwall https://github.com/kwwall what would you say about switching the default regex to ^(ht|f)tp(s?)\p{Print}*$ which will limit to only http/s, ftp/s, and printable ASCII characters? As long as mixed and multi encoding is turned on, the input should still get scanned by canonicalize after the validations are completed, which is really the only other threat.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ESAPI/esapi-java-legacy/issues/376#issuecomment-224614421, or mute the thread https://github.com/notifications/unsubscribe/AB3nm6tFP2HefErzg0oYd8B1ZF2zFG12ks5qJtcSgaJpZM4IhN8s .

planetlevel commented 8 years ago

Agree that people who are concerned about this can tweak validation.properties. I thought Matt’s suggestion to make the default regex even more restrictive was fine. Particularly with a comment that says be careful of ReDoS if you loosen this.

--Jeff

From: "Kevin W. Wall" notifications@github.com Reply-To: ESAPI/esapi-java-legacy reply@reply.github.com Date: Wednesday, June 8, 2016 at 11:34 AM To: ESAPI/esapi-java-legacy esapi-java-legacy@noreply.github.com Cc: Jeff Williams jeff.williams@contrastsecurity.com, Mention mention@noreply.github.com Subject: Re: [ESAPI/esapi-java-legacy] Infinite or very long loop in URL validation URL (#376)

I'd have to think about it, and get feedback from the user community. This regex exists in other places too, like the antisamy-esapi.xml file. I'm reluctant to change the default, especially to something more liberal as that has potential security implications. (The potential ReDoS issue does too, but it's far from clear that every use of Validator.URL is exploitable in that manner.) Personally, at this point, I'm more inclined to not do that, but just add a comment referring to this issue in validation.properties file until we have that better, non-regex API that you are working on. Also, keep in mind, it is expected that developers tweak validation.properties. That's probably the route I'd currently recommend for @sshort.

@planetlevel & @manicode - Care to weigh in? I think this is mostly code you guys wrote.

-kevin Sent from my Droid; please excuse typos. On Jun 8, 2016 10:52 AM, "Matt Seil" notifications@github.com wrote:

If it's a prod issue, then alter the regex to something that makes sense for your application(s) in question that works faster. The problem here ultimately isn't ESAPI, it's java's regex engine and the fact that you found a worst-case input for the regex we chose. Really, great job on that!

That said, the path I'm going down will NOT use the same API, so even if I pushed a fix today, you would have to rewrite all of your applications to use the new API. URLs have always been a special case and they really should never have been treated with regexs as the primary solution.

@kwwall https://github.com/kwwall what would you say about switching the default regex to ^(ht|f)tp(s?)\p{Print}*$ which will limit to only http/s, ftp/s, and printable ASCII characters? As long as mixed and multi encoding is turned on, the input should still get scanned by canonicalize after the validations are completed, which is really the only other threat.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ESAPI/esapi-java-legacy/issues/376#issuecomment-224614421, or mute the thread https://github.com/notifications/unsubscribe/AB3nm6tFP2HefErzg0oYd8B1ZF2zFG12ks5qJtcSgaJpZM4IhN8s .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/ESAPI/esapi-java-legacy/issues/376#issuecomment-224628057, or mute the threadhttps://github.com/notifications/unsubscribe/AClCzEuTGUiqNg7VgKYO481KEgLT0sP8ks5qJuD2gaJpZM4IhN8s.

xeno6696 commented 8 years ago

Well @planetlevel , technically speaking my suggestion would be a little more permissive in terms of a regex validation. The current regex we use attempts some level of RFC-3986 compliance, but the level of backtracking it invokes is obviously the culprit.

With Java, there's only 2 tricks you can use without rewriting the regex from scratch, The first is to make all greedy quantifiers into possessive. That cut the processing time on the regex as described above. But it still took nearly a minute, which is unacceptable. The second trick was to make all capturing groups into non-capturing groups. That didn't budge the needle at all.

I'm halfway tempted to open a bug with Oracle... but I'm kind of sure that they're not going to do anything to fix their regex engine.

As I see it, the problem is that a regex DoS is now discovered against the default ESAPI config, and that is worse than a more permissive regex, especially when we should intend to discourage the use of regex for URI validation in the first place. I've already placed the URI in question into my automated fuzzing arsenal for use against other applications. So my suggestion of ^(ht|f)tp(s?)\\p{Print}*$ is actually far more permissive than the original, because the original at least attempts to force some basic URI structure.

Maybe better, we comment out the line and replace it with Validator.URL=deprecated: read above comment where we supply the more permissive regex with warnings about the correct way to validate URIs.

kwwall commented 8 years ago

@xeno6696 is right about the more permissive nature. That's what I alluded to and mentioned you (although it appears that I got Manico's GitHub user id wrong).

I like Matt's idea about commenting this out the default (i.e., what we have now) regex for Validator.URL and add something like what he mentioned, but I really think we should do something even more to make developers stand up and take notice. After all, if they are only using Validator.URL rarely, they may not sit up and take notice when all their comparisons to valid URLs start failing. I propose we make it into a string that is not even a valid regex so that it won't even compile the regex and Pattern.compile() ends up throwing a PatternSyntaxException. E.g., I was thinking of something like this:

Validator.URL=)-:Deprecated: see comments for Validator.URL in validation.properties file.:-(

I think that should fail to compile because of unbalanced ()s. Plus we get the sad faces as an added bonus! Of course another question is what do do about that same regex that is being used by AntiSamy in antisamy-esapi.xml, where it is used for "offsiteURL". I'm not even sure that is being used by ESAPI, but if that's a standard regex that AntiSamy provides and ESAPI just hijacked it, sounds as though AntiSamy is going to have the same potential ReDoS issue.

xeno6696 commented 8 years ago

Out of 2000 unit tests, I'm down to getting the last 20 or so to pass. Just an update.

kwwall commented 8 years ago

I know we had no where near 2000 unit tests to validate URLs, so are you saying that you created ~2k new URL-like patterns to test against? If so, I'm realy impressed.

-kevin

xeno6696 commented 8 years ago

@kwwall, well, technically no... mockaroo.com created the 2k URL patterns. But I did write the basic parameterized unit test so that all someone has to do in the future is shove in a new URL with a true/false expected validation. I'll be pushing a commit shortly.

xeno6696 commented 8 years ago

So the final default URL I settled on was this: Validator.URL=^(?:ht|f)tp(?:s?)(?:[:A-Za-z0-9%/#?&.=-]*)$

It's quite a bit more restrictive than ^(ht|f)tp(s?)\\p{Print}*$ but it's a heck of a lot simpler now that we don't have to rely solely on a regex to enforce URL syntax validation. It's human-readable now.

Obviously we'll have to decide what to do finally for the regex in release.

xeno6696 commented 8 years ago

Not quite sure how the commits got separated into two separate changesets, but everything builds so I'm not going to touch it.

sdey1977 commented 8 years ago

Can I request your view on the below approach please -

      //URI construction will fail due to non US-ASCII chars,  but URL will pass
      URL testUrl = new URL("http://core-jenkins.scansafe.cisco.com/佐贺诺伦-^ńörén.jpg");
      //URLEncoded form of the problematic URI below
      URI testUri = new URI("http%3A%2F%2Fcore-  jenkins.scansafe.cisco.com%2F%E4%BD%90%E8%B4%BA%E8%AF%BA%E4%BC%A6-%5E%C5%84%C3%B6r%C3%A9n.jpg");

       UriBuilder uriData = null;

        if(null != testUri.getScheme()){ // evaluates to null for the above case
            uriData = UriBuilder.fromUri(enc.canonicalize(testUri.getScheme()));
            uriData.path(enc.canonicalize(enc.canonicalize(testUri.getAuthority() + testUri.getPath())));
            System.out.println(uriData.build().toString());
        }else{
            uriData = UriBuilder.fromUri(enc.canonicalize(testUrl.getProtocol()));
            uriData.path(enc.canonicalize(enc.canonicalize(testUrl.getAuthority() + testUrl.getPath())));
        }
        List<org.apache.http.NameValuePair> params = URLEncodedUtils.parse(testUri, "UTF-8");
        Iterator<org.apache.http.NameValuePair> it = params.iterator();
        while(it.hasNext()) {
          org.apache.http.NameValuePair nValuePair = it.next();
          uriData.queryParam(enc.canonicalize(nValuePair.getName()),        enc.canonicalize(nValuePair.getValue()));
        }
        String printableUrl = uriData.build().toString();

My question is, if I get everything scanned through enc.canonicalize(), what all the possibilities(threats) do I eliminate ? I presume, I'd still need to run the individual canonicalized parts through some REGEX validation .. your thoughts on this please ?

Regards, Dey

xeno6696 commented 8 years ago

My question is, if I get everything scanned through enc.canonicalize(), what all the possibilities(threats) do I eliminate ? I presume, I'd still need to run the individual canonicalized parts through some REGEX validation .. your thoughts on this please ?

enc.canonicalize(String s); buys you a scanner to protect against mixed encoding and multiple encoding attacks. However, canonicalizing any input twice like you do here:

uriData.path(enc.canonicalize(enc.canonicalize(testUri.getAuthority() + testUri.getPath())));

Doesn't buy you anything. If you're doing that to decode the string, you lose the chance to detect multiple encodings.

Also, your input:

URI testUri = new URI("http%3A%2F%2Fcore- jenkins.scansafe.cisco.com%2F%E4%BD%90%E8%B4%BA%E8%AF%BA%E4%BC%A6-%5E%C5%84%C3%B6r%C3%A9n.jpg");

Is invalid in the first place. When I run that in a test program I get:

Exception in thread "main" java.net.URISyntaxException: Illegal character in path at index 18: http%3A%2F%2Fcore-  jenkins.scansafe.cisco.com%2F%E4%BD%90%E8%B4%BA%E8%AF%BA%E4%BC%A6-%5E%C5%84%C3%B6r%C3%A9n.jpg
    at java.net.URI$Parser.fail(URI.java:2848)
    at java.net.URI$Parser.checkChars(URI.java:3021)
    at java.net.URI$Parser.parseHierarchical(URI.java:3105)
    at java.net.URI$Parser.parse(URI.java:3063)
    at java.net.URI.<init>(URI.java:588)
    at org.owasp.URITest.main(URITest.java:9)

But even if it were to pass URI parsing, the spaces make it a doubly-encoded input, because you've got percent-encoded parts in it.

The main danger with what you're doing is that if your URI has a query, canonicalize will for example transform ?&param1=foo&param2=bar into ¶m1=foo¶m2=bar which contains invalid URI characters. That's why in my implementation I treated the query separately.

xeno6696 commented 8 years ago

I missed part of your question:

I presume, I'd still need to run the individual canonicalized parts through some REGEX validation .. your thoughts on this please ?

Yes, there should be some kind of further restriction on the URL. What that is however, is going to be highly context-dependent on your application.

sdey1977 commented 8 years ago

As for, "But even if it were to pass URI parsing, the spaces make it a doubly-encoded input, because you've got percent-encoded parts in it." I think the spaces got introduced while I pasted the code from eclipse into the textarea here (hence the failure at char 18).

"The main danger with what you're doing is that if your URI has a query, canonicalize will for example transform ?&param1=foo&param2=bar into ¶m1=foo¶m2=bar which contains invalid URI characters. That's why in my implementation I treated the query separately." Yes I faced this as well.. is it possible to share your implementation or some reference code please ?

xeno6696 commented 8 years ago

Check the commits to this baseline I did yesterday. The changes haven't been reviewed by @kwwall yet, and I suspect I might have some cleanup to do, but the code is available.

sdey1977 commented 8 years ago

@xeno6696 , if you could kindly have a quick look at this and share your views - The below link suggests that if my interpreter is Browser(HTML content.. just printing the URL on the page) then escaping/encoding non US-ASCII is a safe way to render the text.
https://www.owasp.org/index.php/XSS_(Cross_Site_Scripting)_Prevention_Cheat_Sheet#RULE_.232_-_Attribute_Escape_Before_Inserting_Untrusted_Data_into_HTML_Common_Attributes

_What if I escape the URL parts as below - {PROTOCOL=http, HOST=core-jenkins.scansafe.cisco.com, PORT=-1, AUTHORITY=core-jenkins.scansafe.cisco.com, QUERY=, PATH=/佐贺诺伦-^ńörén.jpg}

boolean isAscii = CharMatcher.ASCII.matchesAllOf(urlPart); if(!isAscii){ if(seg == URLSegment.AUTHORITY && null != parseMap.get(seg)) value = encoder.canonicalize(IDN.toASCII(urlPart)); // for URLs like http://правительство.рф/.. if(seg == URLSegment.PATH && null != parseMap.get(seg)) urlPart = encoder.encodeForHTML(urlPart); //PATH=/佐贺诺伦-^ńörén.jpg }else{ //regular US-ASCII urlPart = encoder.canonicalize(parseMap.get(seg), allowMultiple, allowMixed); }

Is this a safe way to render non US-ASCII chars ? Please note that my intention here is not to validate a URL but to render a URL containing non US-ASCII safely on a browser (view only).

xeno6696 commented 8 years ago

If your goal is to safely render a URL for a browser, you should be escaping the output, it should (sort of ) look like this:

//If the target is an attribute:
<a href="${uri}"/>

then escape it like this:

encoder.escapeForHtmlAttribute(uri);

if the target output is link text:

<a href="foo">${uri}</a>
//escape with
encoder.escapeForHTML(uri);

That's all you actually need to do. Ideally, perform the escaping as close to the page as possible, meaning if you're in a JSP use the esapi taglibs or the taglibs provided by the Encoder project.

xeno6696 commented 8 years ago

@kwwall I'd like to close this out, but a change like this really does require a review. :-)

sdey1977 commented 8 years ago

@xeno6696 Thanks for your response. However, it's not even a HREF .. let's say I just want to put it on the page within a < p > < /p > To be precise, what I wanted to understand is, if "urlPart" in the below snippet can be considered safe for rendering like plain text .. urlPart = encoder.encodeForHTML("/佐贺诺伦-^ńörén.jpg");

xeno6696 commented 8 years ago

If you're just displaying a naked text URL (not providing an anchor tag) then yes, you would use escapeForHTML

kwwall commented 8 years ago

For the record, it were an 'href' attribute for , a 'src' attribute for , etc. ... any attribute that behaves as a URL, then you want to use URL encoding, not just simply HTML attribute output encoding. Otherwise XSS possibilities still exist. Bit since you plan on just displaying it in normal HTML context (e.g., between HTML tags, not in Javascript or CSS, etc.), then you can just use ESAPI.encoder().encodeForHtml(yourUriHere) to do the output encoding.

@xeno6696 will try to review your code this wkend.

-kevin Sent from my Droid; please excuse typos. On Jun 24, 2016 11:36 AM, "sdey1977" notifications@github.com wrote:

@xeno6696 https://github.com/xeno6696 Thanks for your response. However, it's not even a HREF .. let's say I just want to put it on the page within a

...

To be precise, what I wanted to understand is, if "urlPart" in the below snippet can be considered safe for rendering like plain text .. urlPart = encoder.encodeForHTML("/佐贺诺伦-^ńörén.jpg");

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ESAPI/esapi-java-legacy/issues/376#issuecomment-228378892, or mute the thread https://github.com/notifications/unsubscribe/AB3nm7wI-3H-bogqZTHDjIPdxJaBziZoks5qO_l8gaJpZM4IhN8s .

xeno6696 commented 8 years ago

Received your comments, will implement ASAP.

tjain04 commented 5 years ago

Do we have any release planned soon with this fix. ESAPI 2.1.0.1 does not have this fix.

kwwall commented 5 years ago

Release 2.2.0.0 is planned RSN. Just having some maven struggles and doing some final code cleanup. Unfortunately, Dependency Check now seems to be broken. AFAICT, it's nothing that I did; I plan up updating ESAPI to use the latest version of Dep Ck. We're using 4.0.1. I'm going to try 5.0.0-M2 and if that doesn't fix it, I will contact the Dep Ck team. Join the new OWASP ESAPI Project Users Google group that was recently advertised as we will probably announce it first there. -kevin P.S.- In the meantime, if you are desperate, download it from GitHub, build it, and use the latest from the 'develop' branch where it is fixed.

ESAPI / esapi-java-legacy

Infinite or very long loop in URL validation URL #376