Out-of-the-box protection from deep nested and complex queries attacks

FluorescentHallucinogen commented 6 years ago

IMO, Apollo Server should have a protection from deep nested and complex queries attacks by default (out-of-the-box), especially since many developers might not be aware of these concerns.

The algorithm described in “Semantics and Complexity of GraphQL” paper by @hartig and @jorgeperezrojas (see preprint at: http://olafhartig.de/files/HartigPerez_WWW2018_Preprint.pdf) looks very promising. It compute the size of a GraphQL query result (response) without generating this result. The good thing: this algorithm is polynomial in the size of the data and the query/request.

See also:

FluorescentHallucinogen commented 6 years ago

@mxstbr PTAL.

FluorescentHallucinogen commented 6 years ago

The full text of final version of “Semantics and Complexity of GraphQL” paper can be found here: https://dl.acm.org/citation.cfm?id=3186014

FluorescentHallucinogen commented 6 years ago

@acarl005 @taion @ivome @pa-bru PTAL.

hartig commented 6 years ago

I am one of the two authors of the paper mentioned above.

I always wanted to write a less math-heavy blog post that describes the contributions of the paper (including the algorithm) informally. Unfortunately, I have not found the time yet, and at the moment I am on vacation and I will have a considerable backlog to deal with when I come back.

Anyways, in the mean time you may want to see the video of my recent talk at GraphQL Europe: https://youtu.be/oNDmVLRKo_M

Towards the end of the talk I go through example that illustrates how the algorithm works. If you want to check the slides of the talk, find them at: http://olafhartig.de/slides/GraphQLEurope2018OlafHartig.pdf

Regarding an actual implementation of the algorithm, I recently had a student who has built a prototype for the purpose of conducting some experiments. I am still waiting for him to document the code and then we will make the implementation available as open source. However, it really is just a prototype rather than a well-engineered, production-ready library.

By the way, the preprint of the paper and the final version published in the ACM DL are equivalent in terms of content.

taion commented 6 years ago

Unless I'm missing something, both my library, graphql-validation-complexity, and graphql-cost-analysis are linear in the size of the query as with the other validation rules. They just traverse through the query and keep track of depth or a set of multipliers.

That said, I don't know that there's a good one-size-fits-all solution; it's pretty natural to want to tweak the costs, and the two libraries mentioned above have APIs that focus on sufficiently different things – mine on working well with the programmatic API, @pa-bru's on working well with the SDL.

FluorescentHallucinogen commented 6 years ago

PTAL at:

FluorescentHallucinogen commented 6 years ago

That's what I've also found: https://github.com/LiUGraphQL/apollo-server-core.

hartig commented 6 years ago

Let me add a bit more context to the recent comments by @FluorescentHallucinogen

First, I now have finished the blog post I promised. This post provides an informal summary of our research paper, including an example that demonstrates how our proposed algorithm works. Find the post at: http://blog.liu.se/olafhartig/2018/08/08/lightweight-summary-of-our-paper-semantics-and-complexity-of-graphql/

Moreover, a student of mine has written a prototypical implementation of the algorithm. That's what you find at:

https://github.com/LiUGraphQL/graphql-result-size

In addition, to creating this implementation, the student has also done an initial experimental evaluation. The code for the GraphQL server used in the experiment is at:

https://github.com/LiUGraphQL/experiment-tim

This GraphQL server uses a couple of packages that had to be extended for the experiments. These extensions can be found at

hartig commented 6 years ago

@taion the main difference between the libraries that you mention in your comment above and our algorithm is that these libraries can only return an estimate of the query result size whereas our algorithm returns the exact size

FluorescentHallucinogen commented 6 years ago

@hartig What about plugin for https://github.com/prisma/graphql-middleware? See https://www.prisma.io/blog/graphql-middleware-zie3iphithxy/ for more info.

@maticzav Is middleware (resolvers) the right place to calculate query result size?

hartig commented 6 years ago

Conceptually, the result-size check comes in between the validation step and the execution step.

I will read the post about the middleware.

taion commented 6 years ago

Okay, yes, but the point of the heuristics is that it's possibly the calls to the backend that are expensive, i.e. in practice we don't want our GraphQL server to make requests to all the backend services in the first place.

In your implementation, you need to actually query the database to get the result set estimation (e.g. in the getEdges function).

If you replace that with a heuristic, you recover the same thing as the already-implemented complexity cost estimation approaches. With the current implementation, you end up essentially doubling the load to the database, and still don't protect back ends from malicious, expensive queries.

FluorescentHallucinogen commented 6 years ago

@taion

In your implementation, you need to actually query the database to get the result set estimation (e.g. in the getEdges function).

It compute the size of a GraphQL query result (response) without generating this result. IIUC, it don't need to actually query the database.

@hartig Please correct me, if I'm wrong.

@hartig If it really don't need to query the database, can it calculate the size on the client, not server? (This is a theoretical question, because checking on the client does not protect the server.)

@hartig How is it possible to calculate the exact size (in bytes?) without access to the actual data from the database?

taion commented 6 years ago

You trivially can't calculate the exact size without knowing e.g. how long a list actually is, or without knowing whether specific nodes are null or not. Those things aren't possible to determine statically because they depend on the actual data.

See e.g. https://github.com/LiUGraphQL/graphql-result-size/blob/b880c659a00f301c9036766e2fb44f0f183cd765/functions.js#L46-L50.

The point of the static complexity calculation validation is to do this fully statically, from just the query and possibly the variables, without using any actual data.

hartig commented 6 years ago

@FluorescentHallucinogen, @taion is right, our algorithm needs to access the database back-end.

@taion Some comments on your sentence "With the current implementation, you end up essentially doubling the load to the database, and still don't protect back ends from malicious, expensive queries."

The last part of this sentence is wrong. The time complexity of our algorithm is linear in the product of (query size * database size), whereas the time complexity of producing the result may be exponential in the size of the query.
Nonetheless, I see your point about ending up "essentially doubling the load to the database." We have observed this behavior in our experiments. We are already working on a new solution that avoids this issue for queries that are guaranteed not to have an exponentially sized result. Identifying these queries will require access only to the schema but not to the actual data.

FluorescentHallucinogen commented 6 years ago

@hartig What about the time (speed) of calculation? Does the check add a latency to every request? Do we really need high accuracy (exact size)?

hartig commented 6 years ago

What about the time (speed) of calculation? Does the check add a latency to every request?

Yes, the check adds latency. In the experiments with our prototype (and I should emphasize again that this prototype is a very straightforward implementation and can certainly be improved) the latency roughly doubles for "safe queries"; that is, queries whose results do not blow up exponentially. On the other hand, for non-safe queries (whose results do blow up exponentially), the added latency is negligible because producing these large query results takes significantly more time. So, the goal of our new ongoing work that I mentioned is to find a way to identify whether a given query is safe/non-safe by looking only at the schema (and the query itself of course); then, our size-calculation algorithm would have to be run only for the safe queries (which, again, are the cases in which running our algorithm doubles the latency rather than being negligible).

Do we really need high accuracy (exact size)?

It depends I guess. Estimates can always be way off, and I believe that there is no study of the accuracy of the estimation-based approaches.

ivome commented 6 years ago

@hartig I have looked at the implementation of the algorithm. As already pointed out by @taion this significantly adds load to the database backend(s).

I also think that it is theoretically impossible to calculate the exact size of the query without resolving the actual data. If you don't know which values are NULL or not (which is impossible without access to actually returned data), your calculation might be inaccurate. Also when you take Union or Interface types with fragments into account, you have to have the data to determine the size.

Take the following schema for example:

type Query {
  listNodes(limit: Int): [MyUnion]
}
type ExpensiveType {
  # Complexity 1000
  complexField: String
}
type CheapType {
  # Complexity 1
  cheapField: String
}
union MyUnion = ExpensiveType | CheapType

Then try to get the size for this query without loading the data:

{
  listNodes(limit: 10) {
    ...on ExpensiveType {
      complexField
    }
    ...on CheapType {
      cheapField
    }
  }
}

This query could have a complexity of 0, if the array has no results, it could also be 10000 if all results are of type ExpensiveType. Furthermore, the field complexField and cheapField could be loaded from different backends and depend on the node resolved in the listNodes field. A backend for complexField might return NULL or not etc.

You would have to essentially duplicate / execute the data resolving logic for the validation rule, which defeats the purpose of protecting the server.

Even if you duplicate the data resolving logic or replace it with some other more performant data loading operations (like count(*) etc.), you still cannot guaranty an exact result because of the time gap between the execution of the validation rule and the actual execution of the query. The data might have changed in the backend by another process in the meantime, changing the result size.

Correct me if I am wrong, but I am 99.9 % sure that it is not possible to get the exact result size without resolving the data.

taion commented 6 years ago

For complexity calculation heuristics you care about worst cases anyway, or realistic worst cases.

Queries get exponentially more expensive if and only if there are nested lists, and as long as you make some assumption that the list returns more than 1 node, you're okay.

For a well-behaved paginating client, taking the requested page size as the assumed return length is a pretty good but conservative assumption. Otherwise you likely can derive a good enough static estimate for the length of the list.

The contribution of your approach is that, instead of using a heuristic to estimate the length of lists (and whatever it does for nullable values), it queries the backing data store (which in many cases will not just be a database!) – but that's exactly what we do not want.

hartig commented 6 years ago

@ivome

Correct me if I am wrong, but I am 99.9 % sure that it is not possible to get the exact result size without resolving the data.

You are right. As I have mentioned above, our algorithm does look into the data.

hartig commented 6 years ago

@taion

... but that's exactly what we do not want.

I see.

taion commented 6 years ago

To make this more concrete, imagine you're building a product like Facebook, and you're faced with a query like:

viewer {
  friends(first: 10) {
    friends(first: 10) {
      friends(first: 10) {
        friends(first: 10) {
          friends(first: 10) {
            friends(first: 10) {
              friends(first: 10) {
                name
              }
            }
          }
        }
      }
    }
  }
}

"Database size" here could be enormous, so actually running the queries is likely quite unattractive.

On the other hand, the simple heuristic of assuming that a list query for friends(first: 10) returns 10 elements is going to give a nice, conservative estimate of the cost of the query, in a way that works well in practice.

P4sca1 commented 5 years ago

+1 for integrating this functionality directly into apollo-server. IMO protecting against malicious queries is something every GraphQL server implementation needs to consider.

Sangria also has this builtin.

While there are already solutions like graphql-cost-analysis, they are not working with Apollo Server 2 without some hacks.

FluorescentHallucinogen commented 5 years ago

FYI, query cost/complexity validation/analysis is not enough. See the following thread for more info: https://github.com/ravangen/graphql-rate-limit/issues/48.

TL;DR: Cost analysis won't care if you receive one request per second or a thousand. A malicious actor could craft an operation that remains under a cost limit, but send a large number of requests.

You need a fixed window rate limiting in addition to query cost/complexity validation/analysis.

PTAL at https://github.com/ravangen/graphql-rate-limit and https://github.com/teamplanes/graphql-rate-limit.

P4sca1 commented 5 years ago

👍🏻 One should also validate that for fetching a list of nodes, there is always a limiting argument provided. The GitHub API introduces all of this.

Would be great to have this all build into apollo-server as it is crucial for any public GraphQL server.

taion commented 5 years ago

If you're doing the cost controls in your GraphQL server (as opposed to in some proxy layer upstream), you ideally want to combine both – assign some fixed query cost budget per time window, and limit on the total cost of queries executed.

The difference of course is that doing rate limiting well requires something like Redis, while complexity validation can be done purely within each instance of your GraphQL server.

Additionally, there's often a good argument for doing the fixed rate limiting upstream of your GraphQL server anyway, in e.g. Nginx or whatever.

But that's definitely a good point.

See also https://github.com/4Catalyzer/graphql-validation-complexity/issues/6#issuecomment-482709475. The extension APIs available in Apollo Server v2 allow a nicer way of incorporating variable values into cost calculations. I haven't had the time to update things to take advantage of that, though.

jsegaran commented 5 years ago

Hi. Thanks for creating this issue. Apollo server implements a query whitelisting, which is a method through which developers can protect against malicious queries.

In the future, you can integrate above heuristics with the request pipeline, which is discussed in the Apollo Server 3.0 Roadmap.

FluorescentHallucinogen commented 4 years ago

To calculate query complexity we need cost values. But where can we find them?

Today I came across https://blog.apollographql.com/the-new-trace-view-in-apollo-engine-566b25bdfdb0 blog post and one idea come to my mind: what about automatically getting data from Apollo Managed Graph via API and use it as cost values in query cost/complexity validation/analysis npm packages?

Is anyone interested in integrating Apollo Managed Graph API?

I've found the API URL: https://engine-graphql.apollographql.com/api/graphql.

FluorescentHallucinogen commented 4 years ago

@taion @ivome @pa-bru @ravangen @kirkness cc.

taion commented 4 years ago

sorta neat. i think you'd have to pull down the values ahead of time as part of the build process, though. you wouldn't want the validation e.g. making API/db calls to get that value. but that would be a really cool idea

FluorescentHallucinogen commented 4 years ago

@taion Just remembered that Apollo VSCode extension shows field stats:

apollo-vscode

engine-stats

After a little research, I've found this:

https://github.com/apollographql/apollo-tooling/blob/master/packages/apollo-language-server/src/engine/operations/schemaTagsAndFieldStats.ts

FluorescentHallucinogen commented 4 years ago

@taion Agree, caching the stats data and periodically updating the cache would be a great idea!

What about collaborating on implementing this idea into your npm package? I'll help as much as I can. :wink:

taion commented 4 years ago

so... how much do you care about list queries? without https://github.com/4Catalyzer/graphql-validation-complexity/issues/6, we're not going to be able to meaningfully account for query args

graphql-cost-analysis has some better handling for variables, but the API they show doesn't work with batched queries, which we at least use extensively: https://github.com/pa-bru/graphql-cost-analysis#simple-setup

so to have something really good, someone is going to have to make something that (i think) works as an apollo extension, so we can observe the variable values when running the cost analysis

normally i'd say that it'd be better to have a base library that can integrate into this sort of timing/cost data, but... if we need to assume apollo anyway for things to work for batched queries in apollo server, then maybe the cost tracing stats should be built in

thoughts?

taion commented 4 years ago

cc @pa-bru – i don't mean to misrepresent what graphql-cost-analysis can do, in case there is batch support that i just happened to miss

ivome commented 4 years ago

This should work with slicknode/graphql-query-complexity without changes to the core lib.

You can just create a custom estimator that contains and processes the latencies: https://github.com/slicknode/graphql-query-complexity#creating-custom-estimators This could also be combined with fallback values etc.

This also works as an apollo plugin as implemented in type-graphql: https://github.com/slicknode/graphql-query-complexity/issues/7#issuecomment-510209573

taion commented 4 years ago

how do you get the variables when running a batch in apollo?

FluorescentHallucinogen commented 4 years ago

After additional research I've found that for development purposes:

https://demo.apollo.dev can be used that has a fairly extensive schema, history, and traffic
"x-api-key" header must be set

apollo-engine-graphql-api

FluorescentHallucinogen commented 4 years ago

A more interesting question: should we use the duration in milliseconds as cost values "as is"? Is this correct?

ravangen commented 4 years ago

It is exciting to see some more ideas shared around this topic! 😄

If Apollo Graph Manager data was to be used to inform a complexity analysis mechanism, it is probably reasonably (at least initially) to have a solution that targets Apollo Server with setup documentation.

One notable thing with >= Apollo Server 2.4 is its use of an in-memory store that caches (LRU) successfully parsed and validated documents for future requests (https://github.com/apollographql/apollo-server/pull/2111). See the warning at https://github.com/pa-bru/graphql-cost-analysis/issues/32.

In https://github.com/withspectrum/spectrum/pull/2874#issuecomment-381711121, Lee Byron notes that the rules tend to apply in both server and client (tooling) scenarios. He suggests not implementing rate limiting as a validation rule, but instead as it’s own step. This echo's the graphql.org validation docs introduction of validation focusing on the type system and conformity to the specification, not runtime validation checks (such as min/max values).

I would recommend exploring the usage of Apollo Server Plugins instead of a validation rule.

I am open to collaborating as I am not happy with how I currently apply graphql-cost-analysis with Apollo Server. As a plugin, graphql-query-complexity looks appealing too.

FluorescentHallucinogen commented 4 years ago

A more interesting question: should we use the duration in milliseconds as cost values "as is"? Is this correct?

I mean the values from Apollo Graph Manager may be less than 1 (e.g. 0.01926215781919789 ms). Should we normalize these values somehow? :thinking:

FluorescentHallucinogen commented 4 years ago

@ivome @ravangen Thoughts?

ivome commented 4 years ago

I see a few problems with the approach of using the latencies directly:

For fields with input values, you often have an infinite amount of latencies and they can vary dramatically between input values. How do you want to calculate that?
The latencies change over time. A query that works fine one day might see an increase in latency due to an increase in DB size or changes in the network stack, system load, etc. and all of a sudden start failing when the maximum complexity is exceeded. Do the network latencies then break your queries in an unpredictable way?

I'd rather use the latencies to make an informed decision about how to implement the maximum query complexity, which I would do on a worst-case basis to have a stable query validation.

taion commented 4 years ago

@FluorescentHallucinogen If the timings are in e.g. fractional seconds, then it seems reasonable to impose a maximum cost in terms of e.g. total amount of time a query is allowed to take, rather than in arbitrary cost units.

@ivome I think the idea is to use some suitable long-running average to set per-field costs more intelligently, though. It's a nice way to, say, measure which objects require extra REST calls or whatever.

apollographql / apollo-server

Out-of-the-box protection from deep nested and complex queries attacks #1310