Ignasimg commented 9 years ago

So I was porting my web from CI 2.2.0 to 3.0rc2 when I noticed a weird issue with the urls.

The thing is my website requires to pass some base64 encoded value in the url, that means from time to time I'll pass a = character, since it is a disallowed on codeigniter instead of allowing it through conf, back on 2.2.0 I decided to encode / decode the base64 with urlencode / urldecode (basically, transforming = to %3D). And that worked fine.

But on CI 3.0rc2 it doesn't work anymore (complaining about disallowed characters on URL, even thought I didn't change the allowed characters configuration and so it has remain the same in both versions)

After exploring for a little bit I've discovered in 2.2.0 the _fetch_uri_string method, defaultly fetched the uri from a _detect_uri method which also defaultly got it from $_SERVER['REQUEST_URI'] and $_SERVER['SCRIPT_NAME'] substracting the script route from the full route.

Now on 3.0rc2 the _fetch_uri_string method doesn't exist anymore and the task is apparently handled in the constructor. But not only that, the _detect_uri method also doesn't exist, and the default behaviour now is to get the uri from $_SERVER['PATH_INFO'] directly.

And that's the issue, the value on $_SERVER['PATH_INFO'] is already decoded, and hence any encoded character such as %3D is already decoded into it's maybe non-allowed version.

Some things I've tried or thought: -Acording to $_SERVER manpage on php.net there should be a ORIG_PATH_INFO which is the original version of PATH_INFO before processed by php. Aparently that should be it, however I tried it and the only thing I got is an undefined index error (running on apache 2.4.7 with php 5.4.24, I'm aware this is not a codeigniter issue, just to prof I'm not too much outdated) -Why was _detect_uri removed? it really looks like too much work to do for something that could be more easily obtained from a simple server var, but after all, correctness should be the priority. -I don't know which processing is taken to transform ORIG_PATH_INFO into PATH_INFO (beside probably an urldecode), but if that was the only thing that's done, it should be easy to fake ORIG_PATH_INFO encoding PATH_INFO (taking care not to encode slashes /) so... explode -> encode -> implode

Other thought are welcome.

Ignasimg commented 9 years ago

I discarded the last "solution" since someone could be accepting extra unencoded characters and that would make them receive encoded ones.

aanbar commented 9 years ago

I think the new way is the right way of doing it, I don't want anyone bypassing the protection by encoding something that isn't allowed.

You're better off using sessions to store that base64 characters, As you're no longer limited to the 4K size which was there in the previous version of codeigniter.

Ignasimg commented 9 years ago

You are not bypassing anything.

If an user types the url /foo/bar/%3D the bar method in foo controller should receive the parameter "%3D" (which was what correctly happened on 2.2.0)

The bypass would be if you check the url using the %3D and then you send the controller the urldecoded version (in this case being the = character) which would also break the rule of receiving what the user actually wrote on the url.

And the session or anywhere else than the url don't make any sense in my case, I'm identifiying resources by their id which is a binary string converted to base64 for convinience. (But again, that's not the real issue)

narfbg commented 9 years ago

This is not an issue.

Also, the server variable to use for URI fetching is configurable, it's just the default that has changed.

Ignasimg commented 9 years ago

Sorry? So it's broken but it's not an issue... you are funny.

~~You might argue it's a php or apache bug for doing weird things on the PATH_INFO~~, but not an issue?!?!

Take a look at RFC 3986 (section 2.4 when to encode or decode) For the record: "When a URI is dereferenced, the components and subcomponents significant to the scheme-specific dereferencing process (if any) must be parsed and separated before the percent-encoded octets within those components can be safely decoded"

In other words and bringing it to CI you shouldn't decode the url at least until you've determined the parts of the URL (controller / method / arguments...) and hence also done the safety check for unallowed characters.

Or are you making your own standard on top of that?

And no ofense but saying the URI fetching is configurable is like saying I should keep it in session... of course is configurable, I can also configure my website to don't use CI at all... but we are not talking about that, we are talking about something that's broken. (And actually was also broken on 2.2.0, but because of the default behaviour aparently no one noticed)

Edit: Actually just checked RFC 3875 section 4.1.5 and PATH_INFO must be decoded, so the actual problem is in CI using PATH_INFO (unless as said you want to have your own RFC CI30)

narfbg commented 9 years ago

I'm not saying it's broken but it's not an issue. I'm saying it's not broken, which is why it's not an issue. If you want me to be more specific, I'll say it this way: It's not a CodeIgniter issue.

We don't decide how PHP works and unless completely necessary, we won't try to work around it. If not for anything else, if we change how it works, more people will be complaining that their URIs are no longer decoded ... we can't satisfy every single use case. We try to satisfy the majority of use cases and there are numerous ways for you to achieve what you want (personally, I'd just pick an URL-safe version of Base64). It's also unclear why you claim that it is an issue at all (excluding the fact that you've used URL-encoding once to work-around the URI filter) - if the URI filter doesn't get decoded data, it simply won't work.

I might also be misunderstanding what you're saying, but your comments so far only say that something is being decoded by default and you just don't want to happen ... that doesn't look like a bug description.

And no offense meant either, but we don't have that many configuration options just so you can ignore them and rely on auto-detection. Auto-detection is present because CI tries to be friendly even to the most inexperienced users and zero configuration always means easier usage, but it should never be relied on for anything beyond a fast setup. I'd expect somebody who uses IETF RFCs as their argument to have that in mind by default ...

Ignasimg commented 9 years ago

You say: "if we change how it works, more people will be complaining that their URIs are no longer decoded" The funny thing of your sentence is that up to 2.2.0 CI didn't decode the URI, and I'm sure you know better than me, but I quickly looked up on the issues and couldn't find a signle one complaining about their URIs not being decoded.

You say you can't satisfy every use case, and that's simply wrong, you CAN satisfy every use case, if you don't interfere with the use case itself. If you do interfere in the use case, some use cases will break.

"If the URI filter doesn't get decoded data it simply won't work" - wrong! the URI filter should get and hence use the very same data that the user browser requested. (And the very same data that the controller is going to receive!)

And you completely understand what I'm saying, the URI is being decoded by default, and it's not I don't want it to happen, it's it shouldn't happen, unless as said you want to have your own CI standard over those of RFC. (In which case I would stop discussing cause the palm of my hand wouldn't let me read your comments)

And as I see it, it's nothing about making CI more convinient for inexperienced users, whoever has a link on their pages with percent-encoded values, is because they explicitly made it (up to now I don't know anyone who encodes links with special characters "by hand") so I wonder, what's more convinient for both, experienced and inexperienced users, let the user handle it's values as she wishes (no interfering)? or make a funny guess that she wants it decoded and so decode it (interfering)?

In fact it's quite similar with the input class, in the input class you can request the raw value or the xss / mysql injection filtered one, right? I don't know you, but I can conceive a framework without the xss / sqlinjection filters, but I can't conceive one without the raw version.

And finally, (if you say you won't change it, it's ok, but plz answer those 2 questions): 1- Why was _detect_uri method removed in 3.0? 2- Which CGI have you seen that doesn't actually fill SCRIPT_NAME and REQUEST_URI?

drivingmenuts commented 9 years ago

Can you just add the disallowed characters to $config['permitted_uri_chars']?

narfbg commented 9 years ago

@drivingmenuts

It's configuration, you can add anything you wish for yourself. If we added to the default every character that somebody might need, it would be pointless.

@Ignasimg

Look, I may just turn out to be wrong about some of this, but first you'll have to tone it down a bit and then be more clear about what it is exactly that you see as a problem. I'm usually not that polite when somebody acts as cocky as you and I'm only still replying because I feel you might have a valid point. But currently, you're not helping ...

First of all, your initial description only showed that url-encoding doesn't work for you and for some reason you don't want to extend your permitted_uri_chars list for no obvious reason. You also claimed that _detect_uri() was removed and asked why ... Well, I don't consider that a bug report and hence why I closed the issue. And _detect_uri() was not removed - the URI class as a whole went through refactoring and the method is now called _parse_request_uri().

I'll ignore your attitude afterwards, but if you read the actual CGI spec, you'll see that CI is using PATH_INFO for exactly what it was designed for.

And again, as I said - you shouldn't be relying on autodetection. Configuration is made for people like you and you shouldn't be ignoring it on the grounds that auto-detection should always work like you want it. PATH_INFO was simply given precedence over REQUEST_URI in the auto-detection process and there's nothing wrong about that.

3b72eb58e61581b7e92012a322be48e216491d7c 9dd2dbb8b9a3edecddcb3907b65a402fd1ae71b4 (yes, people have complained)

Now, I want to answer to this in particular:

up to now I don't know anyone who encodes links with special characters "by hand"

Attackers do, directly on the URI. And that's why the permitted_uri_chars filter exists ... it would be completely pointless otherwise, it wouldn't exist and we wouldn't be having this conversation. If the filter is applied before URL-decoding, anybody would be able to bypass it because they'd always be able to just URL-encode the characters they want to get through the filter.

Also:

And you completely understand what I'm saying

No, I didn't understand what you meant before your last comment and I'm still not quite sure if you're complaining about not having _detect_uri(), about URL-decoding by default, about using PATH_INFO or all of the three and why ... Even if you're correct in what you're saying, at least one of those complaints would be void.

I've got no problem admitting that I was wrong in something, but so far I'm just not convinced. Please drop the snarky remarks and try to explain yourself better.

drivingmenuts commented 9 years ago

Sorry for the confusion, I was responding to Ignasimg.

narfbg commented 9 years ago

Oh ... my bad, I wasn't sure myself, but helpful suggestions are rare around here. :/

Ignasimg commented 9 years ago

Ok sorry about my attitude, let's start again and hopefully I can explain myself better

1- I'm not complaining about MY problem (it was solved even before I posted the first post, maybe I shouldn't had explained all the bullshit, but I tought it was useful to know why I happened to need it)

2- I'm complaining about CI using PATH_INFO, which takes some arbitrary action on the URI, particularly urldecoding it.

3- There is no bypassing as long as what you check is what later CI uses to call the controller. If as you say an attacker wants to insert an equal symbol into the controller using %3D in the URI, two things might happen: CI uses the decoded version and simply refuses the request because it contains invalid characters, or CI uses the encoded version to check, the check is successful and the controller receives %3D (hence the attack also failed, cause attacker wanted an "=" not "%3D"). (Whoever argues this might still be a security risk cause some third party tool (such as a database) might decode it, yes I agree, but take care of such security risk is not a URI business but a third-party driver business)

4- You say according to RFC3875 PATH_INFO is being used exactly for what it was designed for. And I guess you are refering to the "identifies the resource or sub-resource to be returned by the CGI script" part. But it also states: "Unlike a URI path, the PATH_INFO is not URL-encoded, and cannot contain path-segment parameters.". I'm not sure if a controller and method should be called a resource or a parameter, but what I'm sure is that CI (and any decent framework nowadays) shall support path-segment parameters.

5- About the people you mention complaining about not automatically decoding the url:

388 is simply wrong. What happens is the browser automatically encodes those and so you should decode them by yourself when you want to use it.

705 is the real issue. But after thinking a little bit I think the solution to this should be that the router internally urlencodes the keys of the array, taking care not to encode slashes and wildcards, and also keep the not-encoded ones (for CLI and weird browsers that don't auto encode URLs) in other words, I think the router should emulate the browser behaviour to encode the funny characters.

6- Yet another use case I just happened to notice, and I'll explain all the story like before, but IT'S ALREADY SOLVED! I have a website in chinese and I have some links like /foo/bar/[a few chinese characters] luckily or not the browser automagically percent-encodes the characters, older version of CI checks against the encoded one, no problem, (with the default allowed characters). If using PATH_INFO or the urldecoded version of REQUEST_URI to check for it, you gotta add "weird" rules on the allowed characters to keep your functionality. And not only chinese but spanish, french, german, russian, greek, ... I guess every language but english. And the more special characters a language has the more weird rules you need to add... do you guys seriously think this is normal?

7- I really took some time searching for people on google who couldn't reported their server let REQUEST_URI empty, and the every post complaining is either > 10 years old, or referencing IIS (which are somewhat less old but between 3 and 6 years old) and regarding IIS I didn't try it but there seems to be an official fix for that kb2277918 so I don't see any reason why we shouldn't be using raw REQUEST_URI and only this (beside the CLI)

narfbg commented 9 years ago

1- I'm not complaining about MY problem (it was solved even before I posted the first post, maybe I shouldn't had explained all the bullshit, but I tought it was useful to know why I happened to need it)

2- I'm complaining about CI using PATH_INFO, which takes some arbitrary action on the URI, particularly urldecoding it.

OK, at least that is clear now.

3- There is no bypassing as long as what you check is what later CI uses to call the controller. If as you say an attacker wants to insert an equal symbol into the controller using %3D in the URI, two things might happen: CI uses the decoded version and simply refuses the request because it contains invalid characters, or CI uses the encoded version to check, the check is successful and the controller receives %3D (hence the attack also failed, cause attacker wanted an "=" not "%3D"). (Whoever argues this might still be a security risk cause some third party tool (such as a database) might decode it, yes I agree, but take care of such security risk is not a URI business but a third-party driver business)

It seems to me that you don't understand that filter's purpose and that in turn leads you to conclusions that make the filter feature seem irrelevant. You're not saying that directly, but encoding didn't matter to the filter - that's the logical conclusion.

It's not a magic filter to prevent SQL injections (although it may help with that), nor intended to reject invalid URI characters. It's supposed to validate input data by a user-configured pattern. Yes, there's a big scary warning about it in the config.php file, but that's only because our users tend to be quite careless about input validation and that's a way to encourage them to be more restrictive.

At the end of the day, it is still your configuration and your rules - if you need to allow '=' because it's part of a valid URL input for you, you should absolutely add it to the pattern.

4- You say according to RFC3875 PATH_INFO is being used exactly for what it was designed for. And I guess you are refering to the "identifies the resource or sub-resource to be returned by the CGI script" part. But it also states: "Unlike a URI path, the PATH_INFO is not URL-encoded, and cannot contain path-segment parameters.". I'm not sure if a controller and method should be called a resource or a parameter, but what I'm sure is that CI (and any decent framework nowadays) shall support path-segment parameters.

Yes, a controller/method pair is essentially the resource that you access. You are correct about the rest however, so I guess you will get what you want. :)

5- About the people you mention complaining about not automatically decoding the url:

388 is simply wrong. What happens is the browser automatically encodes those and so you should decode them by yourself when you want to use it.

705 is the real issue. But after thinking a little bit I think the solution to this should be that the router internally urlencodes the keys of the array, taking care not to encode slashes and wildcards, and also keep the not-encoded ones (for CLI and weird browsers that don't auto encode URLs) in other words, I think the router should emulate the browser behaviour to encode the funny characters.

No ... that's too much requirements to satisfy for something that's not the router's job anyway.

To be honest, I never liked those suggestions, but my point stands - people simply want this behavior because it's makes life easier for them, and they do and will complain when they don't have it.

So at this point I'm thinking a configuration value that enables automatic URL-decoding for those who want it.

6- Yet another use case I just happened to notice, and I'll explain all the story like before, but IT'S ALREADY SOLVED! I have a website in chinese and I have some links like /foo/bar/[a few chinese characters] luckily or not the browser automagically percent-encodes the characters, older version of CI checks against the encoded one, no problem, (with the default allowed characters). If using PATH_INFO or the urldecoded version of REQUEST_URI to check for it, you gotta add "weird" rules on the allowed characters to keep your functionality. And not only chinese but spanish, french, german, russian, greek, ... I guess every language but english. And the more special characters a language has the more weird rules you need to add... do you guys seriously think this is normal?

No, I don't think using non-ASCII characters in an URI is normal and I'd rather discourage it.

I did think of a better example on my own however - you may want to use slashes inside of an URI path parameter, which automatic decoding prevents you from doing.

7- I really took some time searching for people on google who couldn't reported their server let REQUEST_URI empty, and the every post complaining is either > 10 years old, or referencing IIS (which are somewhat less old but between 3 and 6 years old) and regarding IIS I didn't try it but there seems to be an official fix for that kb2277918 so I don't see any reason why we shouldn't be using raw REQUEST_URI and only this (beside the CLI)

Well, it wasn't a case for not using REQUEST_URI. It was a case for using PATH_INFO because it's simply easier to work with - it requires less manipulation and that results in improved performance. Obviously, we didn't have your arguments presented at the time, so ... that's that.

Will commit the necessary changes shortly, thanks for your patience.

bcit-ci / CodeIgniter

system/core/uri.php problem related to encoded characters #3593

388 is simply wrong. What happens is the browser automatically encodes those and so you should decode them by yourself when you want to use it.

388 is simply wrong. What happens is the browser automatically encodes those and so you should decode them by yourself when you want to use it.