Reject unkosher URLs - Githubissues

dreeves commented 6 years ago

Give the user an error if any of the following are true:

A. slug doesn't match /^[\w\-]+$/
B. there's a querystring
C. there are any slashes except for the /by/ part

Test cases:

✔ bob.promises.to/do_a_thing/by/fri_5pm → GOOD
✔ bob.promises.to/do-a-thing/by/tomorrow → GOOD
✖ bob.promises.to/foo%bar → BAD
✖ bob.promises.to/foo^bar → BAD
✖ bob.promises.to/do_item_#2/by/9am → BAD

Note that the last one, handling a # character in the URL, is going to be hard: https://stackoverflow.com/questions/18796421/capture-anchor-link

chrisbutler commented 6 years ago

archiving some older notes from the app here so i can pull them out of the code:

Little picture: Why does this error handling code not run when I surf to 
a bad URL like iwill.glitch.me/foo%bar

Bigger picture: We're making an app that needs to process any URL the user may 
throw at it. We need the server to get the exact URL the way the user sees it in
the browser. That's going to be especially tricky for things like '#' characters
but right now I'm just trying to figure out the case of arbitrary '%' characters
that may not correspond to proper %-encodings.

I'm thinking like a catchall error-handling route that no matter what weird
encoding or whatever error is thrown I can still have the code here do something
with the URL as the user typed it.

For now we're just dropping all this and having the app give the user an error
if there are any weird characters in the URL.

app.use(function(req, resp, next) {
  var err = null
  try {
    console.log("TRYING", req.path, req.url)
    decodeURIComponent(req.path)
  } catch(e) {
    err = e
  }
  if (err) {
    console.log("CAUGHT ERR:", err, req.url)
    return resp.redirect('/')
  }
  next();

  //console.log("DEBUG USE1:", req.originalUrl)
  //resp.redirect('/')
})

chrisbutler commented 6 years ago

i'm going to close this because we're doing a decent job of validating promise requests at this point... and i don't think we want to be too clever about 'valid characters' yet

also, #85 will cover requests that are generated programmatically, so the spirit of this issue is resolved

dreeves commented 6 years ago

To clarify, the blacklisting (#85) is about otherwise valid URLs like "/robots.txt" that get GET'd but shouldn't be created as promises.

This issue is about making sure, ideally, that whatever URL the user sees in the browser address bar, we're capturing as the urtext. One that was giving me fits on the Node side was "/foo%bar", ie, invalid percent-encodings. Other weird characters like "^" we seem to be handling fine. Handling "#" is going to be a bitch because the browser treats it as valid but it doesn't get passed along to the server. URLs with a "?" are fine already except it breaks the "/edit" link.

I was thinking if we rejected special characters even when we can actually handle them then we could train users to avoid them and not get bitten by the broken cases like "#" and "%".

chrisbutler commented 6 years ago

@dreeves sounds like something that belongs in the spec, not in github issues

dreeves commented 6 years ago

http://dreev.commits.to/move_kosher_urls_to_spec

dreeves commented 6 years ago

More notes for myself for later:

// The following route rejects URLs with bad characters but it would be better to
// specify a big regex defining exactly what *does* count as a valid promise URL
// and reject everything else.
// NB: Rejecting '#' is moot because we don't see them; the browser eats them.
// Also this isn't matching on query string so rejecting '?' here doesn't help.
// That might be pretty important to fix.
// Things we might want to reject but that at least one existing promise 
// in the database currently uses include:
//   at (@) -- just 1 so far!
//   colon (:) -- just 8 so far but pretty useful for times of day!
//   slash (/) -- hundreds :(
app.get(/^\/_s\/(\w+)\/.*[\!\%\$\^\*\(\)\[\]\=\+\{\}\\\|\;\'\"\`\~\.\&].*$/, 
  nein)

/*******************************************************************************
The deal with routing:
1. The subdomain handler turns URLs like 
   "alice.commits.to/foo_bar/baz" into 
   "/_s/alice/foo_bar/baz" 
   (the "_s" is a magic string. https://en.wikipedia.org/wiki/Magic_string )
   It's the "/_s/..." version that the router matches on.
2. There are a million gotchas and bugs with trying to use Express's partial
   regex syntax -- eg, https://github.com/expressjs/express/issues/2495 -- so we
   want to use actual regexes as the thing to match on here instead of strings.
3. If we wanted to use the string version, here's a handy route tester to see
   how things get matched: https://wesleytodd.github.io/express-route-tester
4. Dots aren't allowed anymore so we don't have to individually route various
   things that rogue bots and things will try to GET, but in case we change our
   mind about dots then we'll need those so they're here commented out.
5. Any file in /public will automatically get served first before any of this
   routing.
*******************************************************************************/

/*******************************************************************************
// Things we might actually want to serve but don't currently serve:
z.get(/^\/_s\/(\w+)\/test\d*\.txt$/, nein) // eg test123.txt
z.get(/^\/_s\/(\w+)\/favicon\.ico$/, nein)
z.get(/^\/_s\/(\w+)\/apple-touch-icon.*\.png$/, nein)
// Things rogue bots or pentesters have tried to GET:
z.get(/^\/_s\/(\w+)\/xmlrpc\.php$/, nein)
z.get(/^\/_s\/(\w+)\/cms\/wp-includes\/wlwmanifest\.xml$/, nein)
z.get(/^\/_s\/(\w+)\/hirevmcyvpgypnk\.html$/, nein)
z.get(/^\/_s\/(\w+)\/blog\/wp-includes\/wlwmanifest\.xml$/, nein)
z.get(/^\/_s\/(\w+)\/site\/wp-includes\/wlwmanifest\.xml$/, nein)
z.get(/^\/_s\/(\w+)\/wordpress\/wp-includes\/wlwmanifest\.xml$/, nein)
z.get(/^\/_s\/(\w+)\/wp-includes\/wlwmanifest\.xml$/, nein)
z.get(/^\/_s\/(\w+)\/wp\/wp-includes\/wlwmanifest\.xml$/, nein)
*******************************************************************************/

commitsto / commits.to

Reject unkosher URLs #28