Sanitize input - Githubissues

DallingerBot commented 8 years ago

From @suchow on October 24, 2015 19:7

We have noticed an issue where incoming HTTP requests will have a / appended to one of the parameters. Our app does not sanitize the input in any way, and so this lead to an error where one of our internal methods complained that the id was not convertible to an integer (from a string) and failed in a most ungraceful manner. Since that time, we added a simple check using .isdigit() that sends back an error response if the input isn't a sequence of digits.

Our questions are:

How would a / get appended to a request parameter?

Is there a general-purpose framework or tool to sanitize input?

Copied from original issue: berkeley-cocosci/Wallace#243

DallingerBot commented 8 years ago

From @cewing on October 24, 2015 19:34

Usually this is an issue with Flask URL configuration and URL matching. Flask has a "feature" that adds slashes to the ends of URLs if you configure the URL in a particular way. I'll check and see if you are falling foul of this.

Can you tell me which endpoints suffer from the problem?

DallingerBot commented 8 years ago

From @suchow on October 24, 2015 20:0

We've seen it the same issue with multiple endpoints.

Most recently, we saw it with /info in custom.py, where the info_type parameter had a slash appended to it:

Oct 16 15:45:40 w4a1dd857-f550-4d26-aef6-4e7a app/web.5: >>>> ????? Error: /node request, participant_id not specified 
Oct 16 15:45:41 w4a1dd857-f550-4d26-aef6-4e7a app/web.7: >>>> A2ZGN Error: /info request, unknown info_type LearningGene/ 
Oct 16 15:45:42 w4a1dd857-f550-4d26-aef6-4e7a app/web.4: ValueError: You cannot get transmission of status pending/.Status can only be pending, received or all

It has also happened for at least one other route with a ..._id parameter. (@thomasmorgan, do you remember which route this happened on the first time?)

DallingerBot commented 8 years ago

From @cewing on October 30, 2015 20:27

@suchow, @thomasmorgan, I'm looking over the info() view in custom.py and the first thing I can say is that given the code I have in front of me, there is no way that the info_type parameter ends up at that view at all, let alone with a trailing slash. I imagine that the experiment that raised this error is not in the examples directory, because in the examples I have there are only two experiments that call the '/info' endpoint: bartlett1932 and function-learning. Both of those experiments have calls to the /info endpoint in the javascript front-end code. Neither uses the info_type parameter. I imagine that whatever experiment was running when these errors came up would offer a bit more in the way of clues as to how the slash ended up appended. Can I get access to that code?

In the meantime, here's a suggestion for handling the problem, assuming that a slash is never a valid char in an info_type:

Your current code looks like this:

    try:
        info_type = request.values["info_type"]
    except:
        info_type = None
    if info_type is not None:
        try:
            info_type = exp.known_classes[info_type]
        except:
            exp.log("Error: /info request, unknown info_type {}".format(info_type), key)
            page = error_page(error_type="/info, unknown type")
            js = dumps({"status": "error", "html": page})
            return Response(js, status=403, mimetype='application/json')

The purpose here is of course to verify that if the parameter has been passed, it is valid, and to resolve it to a class object. You have a number of blocks in your views that match this usage pattern. The general shape is this (in psuedocode):

try:
    get param from request
    validate parameter
    maybe transform parameter
except some error:
    return an error response to the client

Think of a pattern like this as a substitute:

# an exception type you control:
class WallaceValidationError(Exception):
    pass

# a validator that returns a good value or None, or raises a known exception type with a message:
def validate_info_type(request, valid_types):
    """this validator checks that a valid info_type has come in via request

    if info_type is valid, then a valid type class object will be returned
    if it is missing, then None will be returned
    if the type is not valid (not in the provided list of valid types) a WallaceValidationError will be raised
    """
    val = request.values.get(param, None)
    valid_type = valid_types.get(val, None)
    if val is not None and valid_type is None:
        msg = "unknown info_type {}".format(val)
        raise WallaceValidationError(msg)
    return valid_type

# a chain of validation within a single try except, returning a given error response if any fail:

try:
    info_type = validate_info_type(request, exp.known_classes)
    # other validation for optional params could go here
except WallaceValidationError as e:
    exp.log("Error: /info request, {}".format(e.message), key)
    page = error_page(error_type="/info, unknown type")
    js = dumps({"status": "error", "html": page})
    return Response(js, status=400, mimetype='application/json')

With a reasonably thought-out and smallish library of validators, you could use a pattern like this to dramatically reduce code repetition across your views.

DallingerBot commented 8 years ago

From @cewing on October 30, 2015 20:39

One other note on this info view (and most other views in custom.py):

Your current pattern for views is to accept both GET and POST requests and then do this basic pseudocode pattern:

def view(args):
    some shared setup
    if request.method == 'GET':
        do one set of actions
        return one kind of response
    if request.method == 'POST':
        do another set of actions
        return another type of response

This leads to huge, unwieldy view functions with lots of nested if/else blocks that are hard to read and harder to reason about. Given that flask allows you to handle the same apparent end point with different view code depending on request method you could separate this a bit:

def shared_handler_code(request, args):
    shared actions
    return expected values

@route('/info', methods=['GET']):
def get_view(args):
    shared_vals = shared_handler_code(request, args)
    do one set of actions
    return one kind of response

@route('/info', methods=['POST']):
def post_view(args):
    shared_vals = shared_handler_code(request, args)
    do another set of actions
    return another kind of response

This type of approach will dramatically reduce the size of your view functions, allowing you to reason about the actions they take more accurately. It will also allow you to begin to see patterns in the actions taken that may help you to factor other shared bits like the validation stuff above into smaller, abstracted sets of tools that you can re-use across different views.

DallingerBot commented 8 years ago

From @thomasmorgan on November 2, 2015 18:40

@cewing Hi Cris, I've added you and Alec as collaborators to the repo "Rogers2b". This one definitely produced the / errors and also passes the info_type parameter. I'll work on the rest of your suggestions today.

Thanks.

DallingerBot commented 8 years ago

From @cewing on November 2, 2015 18:48

Outstanding, @thomasmorgan. I'll look it over asap and hopefully find the source of the problem.

c

DallingerBot commented 8 years ago

From @cewing on November 3, 2015 23:46

@thomasmorgan, @suchow, I've looked over the code for the Rogers2b repository and I do not see any clear culprit for what would cause the trailing slash to be appended to the request coming from your javascript front-end.

I agree that it's suggestive that the errors from these trailing slashes appear immediately after requests to the /node endpoint where the participant_id was not passed along. However, in looking at the state of the view function served by the /node endpoint as I believe it may have existed at the time these errors happened it looks like the the following flow should have occured:

The /node endpoint is matched to an incoming POST request, but is missing a participant_id query parameter, and so the view function returns a 403 response.
The createAgent javascript function handles the error response by rendering the error page returned by the view

But this does not appear to be what actually happened.

As a question, does the ensureSameWorker function in the tasks.js ensure that the requests from the front-end will be handled by the same web dyno in Heroku? Does the fact that the three log lines you've pasted above have different Heroku dynos handling them mean that they are not the same client requesting them? This is unclear to me.

At any rate, one problem I'm facing in finding a convincing answer to this issue is that the code for Rogers2b that I am looking at is definitely not the code that was run and resulted in the errors above. The new call paths are quite different from the old ones and so I'm having a hard time with being able to be certain what code was actually in production when those errors occurred.

DallingerBot commented 8 years ago

From @cewing on November 3, 2015 23:56

Another quick point. It is interesting to me that the errors raised happened because a trailing slash was appended to a value passed by query parameters. It's even more interesting that it would appear likely that the info_type parameter and the direction parameter might have been the last elements in the query string for the GET requests that were sent by the clients.

I wonder if there's some browser out there that is changing urls that look like this:

/info?participant_id=abc123&node_id=555&info_type=LearningGene

into ones that look like this:

/info?participant_id=abc123&node_id=555&info_type=LearningGene/

Given that the value of "LearningGene" is hard-coded into tasks.js in the Rogers2b experiment, it's pretty much impossible that the trailing slash was appended in any other way. I can't find any reference to issues in reqwest.js that would suggest it might be the culprit, adding a trailing slash after the querystring part of a url. Doesn't seem a likely culprit.

Is there any time in which values incoming via these types of Query Parameters might have a trailing slash in them? If not, one could try to sanitize the incoming values by stripping the "/" character from the end of the string:


val = request.values['info_type', None]
if val is not None:
    val = val.rstrip('/')

If the trailing slash is not ever a viable possibility, then this would at least clean the input and might help you get past such errors.

DallingerBot commented 8 years ago

From @thomasmorgan on November 4, 2015 0:1

@cewing

What you have identified as the flow is correct.

The ensureSameWorker function makes sure that the mturk worker doesnt try to change their worker id during the experiment, it has nothing to do with worker dynos. This has happened before (participant "ayyy lmao" took part) and unfortunately psiturk is perfectly credulous and just assumes this is a new worker sent by AWS, this means we end up with half finished assignments that never actually complete which is very bad news. The ensureSameWorker function looks at the workerid over time and makes sure it doesnt change, if it does it sends the participant to the page tampering.html.

Sorry about updating the Rogers2b code - I wasn't thinking about replicating this past issue. It is only the most recent commit that changes anything from when we ran it though. So if you go back 1 commit and run it from master branch on wallace that will be how it was when we ran it (I think).

We wondered if people's browsers were doing it too, we actually do have data on what browser they were using, but it seemed random

Your suggestion for cleaning is good, the values are unlikely to ever genuinely have a / on the end of them, so cleaning like that would be good. The only downside is that a couple of times a whole load of slashes were appended, e.g. LearningGene/////////// - would your suggestion clean only the final slash?

DallingerBot commented 8 years ago

From @cewing on November 4, 2015 0:8

actually, no. That's the lovely thing about the strip method in Python:

>>> foo = "LearningGene/"
>>> foo.rstrip("/")
'LearningGene'
>>> foo += "/////////////"
>>> foo
'LearningGene//////////////'
>>> foo.rstrip("/")
'LearningGene'

It strips any occurrences of the characters in the string you pass as an argument from the ends of the string on which it is called and returns that new string as it's return value. If you use the lstrip or rstrip variants, then it only strips from the left or right ends of the string.

DallingerBot commented 8 years ago

From @thomasmorgan on November 4, 2015 0:11

ah great, then yes, this will probably be a great solution.

Thanks!

DallingerBot commented 8 years ago

From @thomasmorgan on December 10, 2015 16:49

As an update, this error is still occuring occasionally. Here's an example from the most recent experiment (the log file is 2015-12-01):

Local7  Info    app/web.3   >>>> ????? Error: http://wba8f484f-9864-4378-8af4-4ae8.herokuapp.com/node/11276/received_infos?info_type=AmoebaList%2F GET request, unknown_class: AmoebaList/ for parameter info_type 
608378892952338452  2015-12-01T16:07:33 2015-12-01T16:07:33Z    149417543   wba8f484f-9864-4378-8af4-4ae8   54.196.141.86   Local3  Info    heroku/router   at=info method=GET path="/node/11276/received_infos?info_type=AmoebaList/" host=wba8f484f-9864-4378-8af4-4ae8.herokuapp.com request_id=3fa65f06-a783-4d6d-9f00-cc03ef603b1c fwd="199.116.169.254" dyno=web.3 connect=1ms service=16ms status=400 bytes=1586 
608378893514375169  2015-12-01T16:07:33 2015-12-01T16:07:33Z    149417543   wba8f484f-9864-4378-8af4-4ae8   54.196.211.232  Local7  Info    app/web.3   2015-12-01 16:07:33,556 - wallace.db - DEBUG - Closing Wallace DB session at flask request end

It is very rare (only 2 participants out of 3000) and not game-breaking which is good. However, I have no confidence in what is causing it. My only suggestion is that participants could be copy/pasting the web address from one browser to another and a / is getting appended as a result.

DallingerBot commented 8 years ago

From @alecpm on December 10, 2015 17:33

If it's some issue with manually entered/pasted urls, perhaps you should just do an val.rstrip('/') on these values after extracting them from the request. On the other hand, this seems like another instance where switching to route parameters rather than request variables may make sense in terms of validating and sanitizing input in a more reliable manner.

DallingerBot commented 8 years ago

From @cewing on December 10, 2015 18:2

is received_infos a URL endpoint in a specific experiment you've created? I don't find that path in the main Wallace repository? If so, can you share that repo with us?

DallingerBot commented 8 years ago

From @thomasmorgan on December 10, 2015 18:48

Cris, yes, this is a new route, it's currently at line 594 of custom.py in Wallace so you should be able to see it

DallingerBot commented 8 years ago

From @cewing on December 10, 2015 18:57

yes. I see it now. So this is definitely a place where you could just squash these errors without needing to care how they happen in the first place. Adding a step that strips trailing slashes from value to the request_parameters function, perhaps just after the value is extracted from the request but before it is processed and returned seems safe and sane.

vlall commented 6 years ago

I believe we now sanitize input and do type checking.

Dallinger / Dallinger

Sanitize input #178