distributed-system-analysis / pbench

A benchmarking and performance analysis framework
http://distributed-system-analysis.github.io/pbench/
GNU General Public License v3.0
188 stars 108 forks source link

PR #3122 broke support for APIs using the `HEAD` method #3135

Open portante opened 1 year ago

portante commented 1 year ago

PR #3122 broke support for APIs using the HEAD method. Below is the discussion context from that PR held post merge.


Does this work? This worries me, and seems like we no longer have HEAD implementations for our GET APIs. Normally, Flask implements a HEAD as a GET and discards the response payload (returning only headers). I don't think it does that if the head method is implemented, which you're now doing anywhere, so I think a HEAD ought to now go through to the fallback (not implemented) "abstract" _head.

While I'm not against the idea of being able to implement a specialized _head I don't think this way will work.

Or at least, I know that a HEAD /api/v1/datasets/list worked before, and I'm not confident that it'll work with these changes. (In fact, I don't see how it could.)

_Originally posted by @dbutenhof in https://github.com/distributed-system-analysis/pbench/pull/3122#discussion_r1054934150_

Yes, you are correct that a HEAD operation is not working now. I knew the client allowed it, but didn't think twice about this because we have no tests that exercise the HEAD operation. I have added one to the functional tests in PR #3123.

I am not a fan of this Flask behavior. For certain APIs we really want a separate method for the HEAD operation so that it does not consume server side resources unnecessarily. Pulling a large tar ball or a large query set can be costly.

I'll noodle about this and post a PR with a suggested way to address this.

_Originally posted by @portante in https://github.com/distributed-system-analysis/pbench/pull/3122#discussion_r1055638939_

dbutenhof commented 1 year ago

I noodled a bit on this quite a while back, and decided that for the APIs we had, duplicating code in order to get a slightly more efficient HEAD just wasn't justifiable.

I think that the one exception might be /datasets/inventory where we could pull the size, type (+ cache headers, E-tag, etc) out of the cache manager and generate the headers without manifesting the response stream. (Though I'm not sure how efficiently Flask's defaults will handle that.)

I really don't want to duplicate the code to generate "just headers" unless there's a clear advantage over the default "truncated get" mechanism; the duplication adds its own set of problems.

I decided to defer even worrying about this until we had a clearly delineated example. However the easiest way to enable it would probably be to drop the "abstract" _head and have _dispatch check on HEAD. If getattr(self, "_head") then call it, otherwise call _get and drop the response data more or less as Flask does by default. That allows us to override the behavior where we really want but avoid duplicating code otherwise.

portante commented 1 year ago

I noodled a bit on this quite a while back, and decided that for the APIs we had, duplicating code in order to get a slightly more efficient HEAD just wasn't justifiable.

Certainly duplicating code is a bad thing in general, so I understand the sentiment, but I think it is a bit misplaced.

This assumes that we really do want to allow a HEAD operation for every GET. It also means that without deliberately choosing to allow it, current or future APIs which are expensive, or increasingly become expensive, become an attack vector (deliberate or inadvertantly).

Further, looking through the existing APIs (enumerated below), I don't see enough APIs that return header values to warrant keeping this default of automatic HEAD for every GET.

In fact, I think we can survive just fine without a HEAD operation, except for the inventory (I'd like to use a HEAD for the inventory API to verify a dataset is removed on delete in the functional tests, and it would also be useful to use it for a CLI command to pull a tar ball down for detailed review up close).

Let's look at the APIs available answering to the GET method today:

  1. /api/v1/endpoints
    • No header values returned
    • A HEAD will not provide any useful behavior other than checking that the GET endpoint exists
    • Implementing HEAD could be avoided entirely, or if one wants to provide it as a liveness test, a separate HEAD implementation could just return, without creating duplicate code
  2. /api/v1/user
    • No header values returned
    • A HEAD will not provide any useful behavior ...
    • Implementing HEAD could be avoided entirely, or ...
  3. /api/v1/datasets/daterange
    • No header values returned
    • ...
  4. Inventory APIs
    • /api/v1/datasets/inventory/<string:dataset>
    • /api/v1/datasets/inventory/<string:dataset>
    • /api/v1/datasets/inventory/<string:dataset>/<path:target>
    • This is a candidate for special handling, but the same method could be called from both the HEAD and GET API interfaces where the sendfile call is not invoked
  5. /api/v1/datasets/list
    • No header values returned
    • ...
  6. /api/v1/datasets/metadata/<string:dataset>
    • No header values returned
    • ...
  7. /api/v1/server/audit
    • No header values returned
    • ...
  8. Server Configuration APIs
    • /api/v1/server/configuration
    • /api/v1/server/configuration/
    • /api/v1/server/configuration/<string:key>
    • No header values returned
    • ...

Given this list, all of these APIs that would support HEAD without doing anything by just dropping the response do not DO anything to make that even worth while for the server to respond to in normal operation.

There has to be a good reason to provide a behavior for methods of APIs.

Aside from the inventory APIs, if we can come up with a good reason for those to exist, great, I'd be all for it. But we need to default to not using ANY server resources unless it is truly warranted.

portante commented 1 year ago

For reference, HEAD support in data sets inventory, see PR #3138, https://github.com/distributed-system-analysis/pbench/pull/3138/commits/ae3c766c56c9c25a615af36887513fddad89681b.

webbnh commented 1 year ago

There has to be a good reason to provide a behavior for methods of APIs.

@portante, which problem are you trying to solve?

The previous arrangement (as I understand it...) provided implicit support for HEAD requests to any endpoint which provided a GET request, without requiring any additional implementation. This seems like a Good Thing™ to me, especially since it didn't preclude our overriding the implicit method with an explicit one.

However, you indicated that the implicit HEAD method could become an attack vector. Given that a request to the implicit HEAD method would invoke the same code as one to the GET method, I don't see how suppressing the HEAD method prevents an attack, since the attacker could presumably just as easily exploit the cost of the GET method (possibly to greater effect, since that method also has to return the full payload).

I understand that the implicit HEAD methods don't provide much value in the bulk of the cases, but they also don't cost much, as far as I can see, so I'm not understanding why you want to add code to prevent their use (or am I misunderstanding things?).