Default HTTPHeaderValue validation regex incompatible with new Chrome headers

amtunlimited commented 3 years ago

A new request header rolling out to Chrome stable, the 'sec-ch-ua header that is part of the User Agent Client Hints fail the current default HTTPHeaderValue validation regex. Specifically, the current regex blocks headers with escaped backslashes and double quotes (escaped and unescaped).

The following regex:

^([a-zA-Z0-9()\-=\*\.\?;,+\/:&_" ]|\\\\|\\")*$

Seems to fit the following string:

"\\Not;A\"Brand";v="99", "Google Chrome";v="85", "Chromium";v="85"

I'm not sure how to test, manually or automatically, otherwise I'd post a PR.

I know there's another issue about the new headers but this one seems like a different enough problem to warrant a separate bug.

miketaylr commented 3 years ago

cf. https://github.com/ESAPI/esapi-java-legacy/issues/574 for the multiple encoding issue.

xeno6696 commented 3 years ago

@amtunlimited this regex doesn't break any regression unit tests, but it has two bugs:

Not really a bug I guess, but you're using a capturing group for no reason. Eliminate the parentheses.
Because of your use of OR statements your regex isn't doing what you think it's doing.

It's passing for you because since a match isn't found in the first segment of the regex it then attempts to ONLY match a double backslash. This would allow as valid input \\<script>alert(1);</script>

xeno6696 commented 3 years ago

Google's choice to include backslashes is making this trip up before it even reaches the regex matching section. I tried testing with the regex ^[\a-zA-Z0-9()\-=\*\.\?;,+\/:&_" ]*$ but we fail a mixed encoding check by including the backslashes.

@amtunlimited can you confirm what error you were originally receiving before you played with the regex?

amtunlimited commented 3 years ago

can you confirm what error you were originally receiving before you played with the regex?

Unfortunately I can't. I'm actually a developer from Chrome, relaying this bug here

We've seen a couple of bugs come in recently regarding user agent client hints and ESAPI (which is somewhat surprising, given the roll out has be going for a little while now), and we'd like to make sure this is fixed for as many folks as possible. Especially because we plan on using more headers in this format (see Stuctured Headers)

I might try get a test server running. If you have any documentation regarding that, let me know.

xeno6696 commented 3 years ago

I see precisely why you thought it was a regex bug.

If you clone esapi-java-legacy and can succesfully run "mvn clean test" we should be able to team- it through the unit test "ValidatorTest."

Please don't yell at me for how some of these test classes are organized... they need to be brought up to at least 2015.

I can't explain why you're suddenly seeing this now, this behavior should have popped up super early in testing. Actually it seems our unit testing here is pretty light. (Not the first time I've run into it here.)

In ValidatorTest.testGetHeader() I added:

request.addHeader("f3", "\"\\\\Not;A\\\"Brand\";v=\"99\", \"Google Chrome\";v=\"85\", \"Chromium\";v=\"85\"");
       assertFalse(safeRequest.getHeader("f3").equals("\"\\\\Not;A\\\"Brand\";v=\"99\", \"Google Chrome\";v=\"85\", \"Chromium\";v=\"85\""));

At the moment I'm leaving the regex alone, hence why you're seeing the assertFalse. The exception getting thrown is because the canonicalize method is getting called here and my money is that because of the multiple backslashes we're triggering the multiple encoding from the JavaScript codec.

Scratch that. We're getting 2x encoding no matter which codec I isolate against...

xeno6696 commented 3 years ago

Okay, figured out that the DefaultEncoder class hardcodes the codecs used and does not obey the configuration in ESAPI.properties. Special thanks to @jeremiahjstacey for keeping me sane.

I've definitely narrowed down the double encoding check being the root cause for both bugs. In short, legacy logic runs each codec twice, if there is any change in the output on the second pass we assume someone's monkeying and throw the IntrusionException.

The part where I'm at a loss for though is how to handle this. @kwwall I wouldn't mind your input.

This isn't a bug on our part exactly, but it's a case where it appears the new Google specification for this header is incorporating backslashes in a matter the original designers didn't anticipate.

My gut says that we (as in chrome) should consider percent-encoding the backslashes, as they are used as a character escape in every C-inspired language that I can think of. @amtunlimited what do you think here?

Assuming this is a no-go... Outside of adding a new method to get this special header from the SafeRequest class, I can't think of any elegant solution to this problem. The only other options are adding backslashes to the immune char list to JavaScriptCodec (thus compromising its entire purpose) or in advising configurations to turn off multiple encoding which is also a very bad idea.

miketaylr commented 3 years ago

This isn't a bug on our part exactly, but it's a case where it appears the new Google specification for this header is incorporating backslashes in a matter the original designers didn't anticipate.

My gut says that we (as in chrome) should consider percent-encoding the backslashes, as they are used as a character escape in every C-inspired language that I can think of. @amtunlimited what do you think here?

This is invaluable feedback, thank you.

xeno6696 commented 3 years ago

I did some initial testing for fun, I percent-encoded the entire header value:

%5C%5CNot%3BA%5C%22Brand%22%3Bv%3D%2299%22%2C%20%22Google%20Chrome%22%3Bv%3D%2285%22%2C%20%22Chromium%22%3Bv%3D%2285%22

We still get both multi and mixed encoding exceptions here. Not a quick win.

kwwall commented 3 years ago

@xeno6696 wrote:

The part where I'm at a loss for though is how to handle this. @kwwall I wouldn't mind your input.

First off, I agree with you that this really isn't an ESAPI issue. The DefaultEncoder implementation is intentionally encoded in such a way that it tries to protect against multiple encoding attacks and I have no intention of changing that. (If someone wants to open a new GitHub issue to request a new feature as to whether or not ESAPI's DefaultEncoder should honor the "Encoder.DefaultCodecList" property, I am certainly willing to discuss that. If so, refer to this issue so we can link the two. But I do NOT want to consider that here. As @xeno6696 said; this is not an ESAPI bug. (If anything, I think the 'bug' is that "Encoder.DefaultCodecList" should really have been called "Validator.DefaultCodecList".) Doing output encoding of Request headers is not something that one would do anyway and doing them blindly for all headers -- I think that's completely inappropriate as XSS defense when used as a JavaEE servlet filter or Spring interceptor, etc. Normally, the only time you would blindly doing output encoding of all HTTP request headers would be if you had something like a save 'snoop' servlet used for troubleshooting that you wanted to behave as a safe replacement for something like the HTTP TRACE request method. Surely if you want to echo back all the HTTP request headers, then you want to output encode them. But even then, probably not blindly. (And if you are doing that, you would also need to prevent HTTP header injection type attacks like HTTP Response Splitting, which ESAPI's Encoder does not do.)

Secondly, IMO, I think altering the HTTPHeaderValue regular expression is also the wrong path to take, especially regarding the one that was suggested in this GitHub issue. We cannot be special-casing all the possible oddball cases.

Thirdly, to the Google Chrome team, I would suggest replacing the '\' characters with something that don't need to be encoded at all, such as '-', '_', or '.'. Unless I am missing something, I don't think there is any reason to use a backslash here.

Lastly, to @xeno6696, I think we should mark this as 'wontfix' (unless we intend on changing the property name, which I don't recommend because that could break backward compatibility or we only intend on clarifying our documentation both in Javadoc and for the "Encoder.DefaultCodeList" property in the ESAPI.properties file.

xeno6696 commented 3 years ago

@kwwall I found a bug in the default regex:

^[a-zA-Z0-9()\\-=\\*\\.\\?;,+\\/:&_ ]*$ contains an error in that we never escaped the - symbol, so the regex engine thinks it's supposed to be a range like a-z and since none is given it silently fails. I'm... a little irritated that Oracle's Pattern compiler doesn't at least provide a warning about this behavior.

xeno6696 commented 3 years ago

Created issue #579 to deal with the regex bug.

mnot commented 3 years ago

sec-ch-ua isn't the last header that will be like this, because Structured Fields is regarded by many as the default for new HTTP fields.

However, note that SF only allows and uses backslash escaping for two characters: " and \. A backslash followed by any other character is invalid. Furthermore, this only occurs in the sf-string rule.

Could ESAPI make a limited exception for these escape sequences? If allowing \ in unknown payload data is judged to be a security risk, it might be reasonable to limit the exception to " (which is the primary use case, and I believe what you're hitting with sec-ch-ua).

xeno6696 commented 3 years ago

I think it's pretty clear then that this would be a new feature request. I wasn't aware of the structured-field RTF. I guess that makes sense as it was just published yesterday. [Obviously I wasn't consulted... c'mon guys!] ;-)

Could ESAPI make a limited exception for these escape sequences? If allowing \ in unknown payload data is judged to be a security risk, it might be reasonable to limit the exception to " (which is the primary use case, and I believe what you're hitting with sec-ch-ua).

Not as written. The logic we use is extremely simple and lacks real intelligence. Basically we walk through an input char by char codec by codec and count transformations. If two codecs change an input, mixed exception. If Two passes with the same codec change an input, multi exception. By default we walk through 3 codecs, so six passes and then throw exceptions at the end. What makes it hard is that the Codec that will be tripping up in our use case is the JavaScript codec. The backslash will trigger javascript escape warnings as neither it nor the calling methods are aware that they're being used in the context of an HTTP header.

I need to think on it but I don't think there's a way out of this for us but with a rewrite using grammar-based parsers.

kwwall commented 3 years ago

@xeno6696 - That would likely be a major (albeit, fun) rewrite and while I wouldn't be against that for ESAPI 3, my fear is that what every we use do do this would drag in a bunch of additional transitive dependencies. IIRC, were are already close to 30 which I think is a lot for such a small library.

amtunlimited commented 3 years ago

I'm guessing the encoder's problem is the \\ being turned into \ and the encoder reducing that again for its multiple encoding check.

But yes, this probably won't be the only Structured Header. It would also put you more in line w/ RFC-7320 section 3.2.6, which allowed for escaped (here called "quoted pairs") backslashes and double quotes.

If I'm right about the multiple encoding, here's a half-baked idea: make a structured header encoder (only checking for escaped that "supersede's" the JS encoder. The workflow might go like this:

Run htmlCodec
  If change, run again for multi
Run percentCodec, 
  If change, run again for multi
Run structuredHeaderCodec
if that worked:
  run again for multi
else:
  Run javascriptCodec, run again for multi

You could add a "run special canonicalize" property that would run unrolled loop above if there was no config.

xeno6696 commented 3 years ago

I thank you for taking the time to take a crack at it @amtunlimited but the way the library was designed makes this more difficult. I don't want to TLDR you, so in short, by the time we're analyzing mixed and multi encoding it's far too late to be thinking about the kind of data we're being presented. That detection has to be abstract enough to handle codecs in any order and therefore creating a "chain" of codecs that rely on execution order breaks the contract. But because there's a collision between character escapes between structured fields and javascript, I don't think this can be handled without significant rewriting of this part of the code.

So the problems are this:

Both Javascript and structured fields use backslash escapes. Chaining these together in any direction will trigger mixed encoding exceptions. This was never a problem before because the library defaulted to only chaining HTML, Percent, and Javascript which don't share escape syntax. Further, the actual backslash character \ is legal in structured headers, so it isn't encoded, so unlike say &gt and > there will be no string transformation to record in the existence of a "structured header codec." Because of that, the backslash will always exist and \\ will trigger an encoding exception. This is where the standards writers should have done something safe like enforcing a percent-encoding for characters that can be interpreted downstream. Or a python/Javascript-like byte representation like \x5C.
We can't remove Javascript from the config, both for that dreaded backwards compatibility as well as the fact that the three encodings already discussed are the primary encodings hackers play with to bypass XSS filtering.

This rules out your suggestion pretty strongly I think.

I'm not throwing in the towel yet, I might think of something over the weekend.

amtunlimited commented 3 years ago

Further, the actual backslash character \ is legal in structured headers, so it isn't encoded, so unlike say &gt and > there will be no string transformation to record in the existence of a "structured header codec."

Small correction here: According to section 3.3.3-6 of the spec, A backslash without a corresponding backslash or dquote is not legal and should cause parsing to fail. Also, backslashes should only show up inside of a quoted strings, although I don't know if that helps you.

So the big issue here is that the Structured Header grammar as described here is a subset of the javascript codec. hmm...

Maybe we can shift the problem down a layer? Could a javascript/structured header codec be made? It tries the Structured header codec, but if it sees a backslash with a character other the ones allowed then it falls back and restarts with the javascript codec? I will admit that also seems washy...

I'll agree, it's a hard problem.

amtunlimited commented 3 years ago

Sidestepping a little bit, you said the real issue is people using DefaultEncoder for HTTP header values when they shouldn't be, yeah? Maybe the better approach is just to make a HTTPHeaderEncoder? I don't know the policy around reference implementations so it may sound like a silly idea.

Sidestepping the issue even more, people are going to have to upgrade the library then there's going to be friction anyways. Maybe the best interest is to:

go the "honor the "Encoder.DefaultCodecList" property" route mentioned by @kwwall (I'd be happy to open a new feature bug if this seems reasonable)
Implement a Structued header codec (a copy-paste of the JS one with some trimming and hopefully not much more work)
Make a comment in the config file like "If your using this encoder for headers, use this list of codecs instead, or better yet don't use this encoder" (and update the relevant documentation to match)
Get the word out to update and shift.

miketaylr commented 3 years ago

@xeno6696 it would be useful for us to test variations of structured headers against a server running ESAPI. Does such a thing exist? One possible path forward is for us to remove the escaped \\ and \" from the header, but it's not 100% clear to me if the unescaped quotes are going to be problematic even after doing that.

from: "\\Not;A\"Brand";v="99", "Google Chrome";v="85", "Chromium";v="85" to: "Not;A Brand";v="99", "Google Chrome";v="85", "Chromium";v="85"

(Maybe that's not a realistic ask, since it seems some folks are sanitizing headers incorrectly - maybe like so? https://github.com/zhanjindong/security-demo/blob/e71c178fe173fd28723466b596d99ead4f2fb3da/src/main/java/xss/EsapiRequestWrapper.java#L48-L58)

amtunlimited commented 3 years ago

FYI for the thread: we've temporarily rolled back the compatibility-causing-characters from the Sec-CH-UA header in order to unblock the wider User Agent Client Hints roll out. You should see the following value for the Sec-CH-UA in Chrome 87.0.4280.88 and above:

"Google Chrome";v="87", " Not;A Brand";v="99", "Chromium";v="87"

I believe there's still an issue with the regex not supporting quotes and possibly some other characters, but hopefully this will resolve the encoding issue. If someone could confirm this for me, that would be lovely. If this is the case, the other customers should be one regex-adjustment away!

We would like to roll the characters back in at some point (there are plans to use structured headers in more places), so if there's anything I can do to help on that front please drop me an email.

gav28uk commented 3 years ago

We're still seeing an issue with an application that we use even with the rolled backed user agent strings, this is running latest Chrome 87.0.4280.88:

"Google Chrome";v="87", " Not;A Brand";v="99", "Chromium";v="87"

The application is rejecting the login with the following errors:-

2020-12-03 11:13:08,531 [[ACTIVE] ExecuteThread: '8' for queue: 'weblogic.kernel.Default (self-tuning)'] ERROR com.common.utils.Security:241 - sec-ch-ua has unacceptable character(s) in value ["Google Chrome";v="87", " Not;A Brand";v="99", "Chromium";v="87"]

So I would suggest the Regex is still failing :-( We're having to advise the user community to use Edge or Firefox. They don't have local PC admin access so can't turn off this feature when they are part of the field trial :-(

kwwall commented 3 years ago

Do any of you know in which specific context this is popping up in? If it is cropping up in a ESAPI provided JavaEE servlet filter, we do have some liberties there. E.g., we could either pass in a header ignore-list or allow a custom class to be specified that we could invoke via a callback that would be responsible for processing a specified header list to handle these structured headers. Of course, it it is in a bunch of custom code that it is popping up that is just using ESAPI DefaultEncoder or similar classes, we probably are more limited in terms of what we can do in terms of simple extensions / bug fixes.

kwwall commented 3 years ago

I just took another look at DefaultEncoder. While the usual use of it is via ESAPI.encoder() which eventually calls DefaultEncoder.getInstance() which ultimately is intended to enforce a singleton (fail!), there is in fact 2 public CTORs. The default CTOR which the static getInstance() invokes is the one that sets up the ArrayList of the 3 ESAPI Codecs: HTMLEntityCodec, PercentCodec, and JavaScriptCodec. However, there is also a 2nd public DefaultEncoder CTOR that looks like this:

    public DefaultEncoder( List<String> codecNames ) {
        for ( String clazz : codecNames ) {
            try {
                if ( clazz.indexOf( '.' ) == -1 ) clazz = "org.owasp.esapi.codecs." + clazz;
                codecs.add( Class.forName( clazz ).newInstance() );
            } catch ( Exception e ) {
                logger.warning( Logger.EVENT_FAILURE, "Codec " + clazz + " listed in ESAPI.properties not on classpath" );
            }
        }
    }

So it looks like there is a workaround possible via that CTOR by directly using the DefautEncoder class rather than working soley through the Encoder interfaces. Do any of you who are encountering this issue think that would be a feasible workaround? I still am not clear in the exact context that this is popping up in so I can't really comment further.

amtunlimited commented 3 years ago

At least some should be able to change the Validator.HTTPHeaderValue regex (the original purpose of the bug)

kwwall commented 3 years ago

@amtunlimited wrote:

At least some should be able to change the Validator.HTTPHeaderValue regex (the original purpose of the bug)

We will be correcting the regex as per GitHub issue #579 but we are not planning to extend the regex to accommodate structured headers. ESAPI's defaults are ultra conservative in order to protect applications against overly liberal browsers. As @planetlevel mentioned in some other GitHub issue that I don't recall, this is intended to provide a "safe harbor" of sorts. This is especially true for defaults that can be easily changed by application developers. Change it to whatever you would like it to be, but we can't ask everyone to do that. These are in a properties file for a reason.

However, as @xeno6696 mentioned, depending on how this is coming up as an issue (that's why I asked again; I see no stack trace, log messages, sample code, etc. that would explain the context when this is causing a problem), it is very likely that simply tweaking the regex is not going to address the problem. Go back and read carefully what @xeno6696 wrote here. Depending on the context where this is arising, simply changing the regex may likely not be sufficient.

kwwall commented 3 years ago

In this comment, where @miketaylr referred to this example of incorrect sanitization: https://github.com/zhanjindong/security-demo/blob/e71c178fe173fd28723466b596d99ead4f2fb3da/src/main/java/xss/EsapiRequestWrapper.java#L48-L58

@amtunlimited - this is exactly the sort of thing that I'm referring to in my previous comment. This particular use of ESAPI's encoder is simply misguided and it will have the encoding problem with structured headers that I mentioned in my previous comment that @xeno6696 previously called out. Changing the regex is not going to fix situations like this. That's why the ESAPI developers would like to better understand the context of where ESAPI is causing problems. Some things we may be able to address, but others we are much more limited.

planetlevel commented 3 years ago

I've skimmed the history here. Is it fair to say we're trying to canonicalize this header without knowing what we are going to use it for? Philosophically, the ESAPI approach was intended to be that you canonicalize before you validate and finally escape before using. Probably you should do ALL of this somewhere near where you are going to use the input.

The big mistake here is to think that you can canonicalize for everything up front and you're done forever. The ESAPI Filter was intended to demonstrate how ESAPI could be used, not to be a universal super-canonicalizer for any kind of input anyone could possibly send.

The right approach (as Kevin mentions above) is to instantiate an Encoder with an array of the codecs you need for whatever you're going to do with the data. No sense decoding Javascript if you're never going to stuff the input into Javascript.

kwwall commented 3 years ago

@planetlevel - Thanks for adding your input. You asked:

Is it fair to say we're trying to canonicalize this header without knowing what we are going to use it for?

That's part of what we are trying to understand. It is still not 100% clear (at least to me) the context or contexts where this is arising. That's what I'm trying to understand.

ESAPI / esapi-java-legacy

Default HTTPHeaderValue validation regex incompatible with new Chrome headers #578