support rules specifying origin and/or destination URL path(s)

msxfm commented 10 years ago

Issue by alexlehm Monday Apr 09, 2012 at 20:40 GMT Originally opened as https://github.com/RequestPolicy/requestpolicy/issues/299

It would be helpful if whitelist entries could contain urls with subdirs in addition to the domain, e.g. for the url redirect on search engines or on facebook this would be useful, since I wouldn't want to whitelist the complete domain but all urls that are redirected away from the domain, e.g.

on google.com, when I click on an url in the search results, the url is first redirected to www.google.com/url?...url=... which then in turn redirects to the target url.

The whitelist entry for this would have to be allow origin google.com/url\?.*

msxfm commented 10 years ago

Comment by jsamuel Saturday Apr 14, 2012 at 16:53 GMT

Paths in rules will be supported in 1.x. The machinery is place to do this but for now I really want to focus on getting 1.0 released before adding more features.

myrdd commented 9 years ago

currently paths are not supported. IMHO we should first get v1.0 stable. What do you think @alexlehm?

alexlehm commented 9 years ago

this is more of a convenience feature and not a bug so it certainly can wait for the next version

myrdd commented 9 years ago

A note in advance: Don't confuse support for URL paths (this issue) with support for filetype detection. Those two issues are related but not the same.

Summary of the discussion up till now

This summary also considers the comments in the duplicate #317.

The facts

First of all, it is possible to implement support for path-based rules. Good! :)

Secondly, when thinking about URL paths, it should be borne in mind that URL paths can be easily rewritten – see Rewrite Engines. This means that a path ending with .png could easily be in fact a javascript file or any other arbitrary file.

URL Rewriting is mostly unwelcome in case of CSRF attacks.
Because URL Rewriting is possible, a file extension in an URL path isn't something that could be trusted. Some more thoughts on trust I've written down below.

Finally, the user should be informed about URL Rewriting and should warned against their risks and implications. As @nodiscc has said, there should be a clear warning about path-based rules in the "Create rule" section of chrome://requestpolicy/content/settings/yourpolicy.html. IMO adding rules with a path through the menu shouldn't be possible.

What's yet to be decided…

It's to be decided how the path part should look like – whether it should wildcards, regex, or both. I think it would be easier to start with wildcards only; regex support could be added later.

Some thoughts on „trust“ …

…, e.g. the trust of URL paths and guessed filetypes.

@eibwen stated correctly that when a request is allowed, it's possible to compare its guessed filetype with the actual request response. This is quite an interesting discovery. For example, you could at first remember all allowed request that originated from <img> HTML Elements. As soon as you get the responses from the server, you could check whether it really is an image. In other words, you would could compare the request destination's file extension (.jpg, .png, …) with the the actual media type of the HTTP response.

The procedure I described above IMO has potential to be integrated as an enhancement into RequestPolicy or into another addon – in order to determine or measure a site's „trust“. If anyone likes that, I suggest to create a new issue for that.

mk-pmb commented 8 years ago

When adding regexps, let's find a clever way to deal with dots and slashes, as lots of backslashes impede reading. Third party tools can help with rule creation but less with debugging.

myrdd commented 8 years ago

@mk-pmb Justin aimed at having both "path prefix" and "path regex". If a "path prefix" is specified, the path has to start with the given string. For more complex cases you need to use regex. For performance reasons I'd approach JavaScript's built-in regex syntax, which indeed requires to escape dots. However, when you use the RegExp constructor, which is what I'll do, you don't need to escape slashes. See:

r = new RegExp("/");
// …is equivalent to…
r = /\//;

I think that's good enough.

Implementation note: care must be taken with backslashes in combination with JSON. See:

let s = "\\."; // string will be "\."
let r = new RegExp(s); // regexp will be /\./
let j = JSON.stringify({pathRegex: s}); // JSON string will be "{"pathRegex":"\\."}"

mk-pmb commented 8 years ago

:+1:

alexlehm commented 8 years ago

I like the way this is done in Foxyproxy, you can either use wildcard patterns or regexp, for most use cases a wildcard is sufficient like *://*.yahoo.com/* and for more complex ones you can use a regexp

mk-pmb commented 8 years ago

However, there are also quite some cases where users mistakenly think that wildcards are safe enough. Re-using the example *://*.yahoo.com/*, some people might be surprised when their Yahoo proxy setting is used for http://mallory.example.net/tracker.php?from=http://mail.yahoo.com/whatever&spam=XOBiLiZLlnxr .

myrdd commented 8 years ago

Foxyproxy's URL wildcards are nice, but the way RP works it would be path wildcards. Not sure if it is very helpful; Foxyproxy simply translates the wildcard rules to regular expressions. I think we're safe only supporting path-regex (and path-prefix). Instead of writing /some/*/path/* you need to write ^/some/.*/path/ (or ^/some/[^/]*/path/), which is ok IMHO. If desired, path-wildcards can be added to a later time.

Btw @alexlehm I've edited your post due to a parsing error in the wildcard string. Asterisk (*) means "bold" in Markdown. Surrounding with backticks (```) helps.

alexlehm commented 8 years ago

@mk-pmb yes you're right, I had that with some tool that tried to match play.google.com and matched a search parameter where the search text was play.google.com

alexlehm commented 8 years ago

actually the wildcard problem applies to regexp patterns as well, since a "naive" user would use .*://*.yahoo.com/.* which would match the domain somewhere in the path or query. Correct would be something like [^:]+://[^/]+.yahoo.com/.*

myrdd commented 8 years ago

@alexlehm yes, so the usage of [^/] in the path-regex should be encouraged.

mk-pmb commented 8 years ago

I also suggest checking whether the matched portion of the string is exactly equal to the original string, so we don't have to worry what exactly can be matched by ^ and $ and what happens when URLs grow really huge. Just some days ago my sed failed due to maximum line length, and I think that could be deep down on C layers, as I think sed rather provides a thin wrapper than doing a lot of regexp magic itself. If so, we shouldn't be surprised if a browser is affected at some future time. Wouldn't be the first instance of "the library docs clearly stated that your input has to conform to […]".

[^/]

That should match \x00, which isn't a problem for JS, but it's far away from probable original intent, so there should be lots of room for malicious interpretation. And we didn't even invoke non-paired unicode surrogates. If common use cases like "limit this portion to the host name" are aided by RPC, we could motivate users to use more high-level expressions that RPC could fix for them, rather than trying to solve all RegExp caveats by themselves and cement their solution level in their configs.

Correct would be something like [^:]+://[^/]+.yahoo.com/.*

The problem is with the "something".

$ grep -qxPe '[^:]+://[^/]+.yahoo.com/.*' <<<'http://mallory-yahoo.com/tracker.php?from=…' && echo oh my grep
oh my grep

The more general problem with RegExps is that above mistake wasn't obvious to readers less familiar with RegExp abuse.

myrdd commented 8 years ago

Please stay on-topic. This issue is only about path RegExps. RPC handles the URI/URL parts "scheme", "host", and "path" completely separately. Having a rule with a "Path-RegExp" won't affect host-matching at all.

By the way, the wildcards allowed in RPC's domain-name-specification do not translate into regexps. The "host" string is split by the literal dots (.).

I also suggest checking whether the matched portion of the string is exactly equal to the original string, so we don't have to worry what exactly can be matched by ^ and $ and what happens when URLs grow really huge.

Could you please give some further explanation what you mean? Which two strings do you want to compare? What do you want to prevent by doing so?

Regarding your comparison with sed, JavaScript has no such maximum string length. Since there are multiple add-ons operating on URLs using RegExp, we should be safe here.

What's exactly the problem with non-alphanumeric characters? What kind of malicious interpretation?

mk-pmb commented 8 years ago

This issue is only about path RegExps.

Right, I'm sorry, I forgot that and for some reason, at time of my posting, your reminder and some other messages weren't visible to me yet. All attack vectors I had in mind would require a full URL with a hostname, so they don't apply here.

RequestPolicyContinued / requestpolicy