confluentinc / ksql

The database purpose-built for stream processing applications.
https://ksqldb.io
Other
123 stars 1.04k forks source link

Pull Query: Disable Pull Queries when auth validation is needed on `/query` endpoint #3863

Closed vinothchandar closed 4 years ago

vinothchandar commented 4 years ago

Now that the pull queries are moved to /query endpoint, we took a look at what validation happens on this code path.

We are invoking KsqlAuthorizationValidator#checkAuthorization() for every request (push or pull) when either ksql.access.validator.enable=on or ksql.access.validator.enable=auto and the kafka cluster has a non empty value for authorizer.class.name.

Even as #3696 avoids eager instantiation of the adminclient (+ other clients) for every request, it simply memoizes it per ServiceContext created for every request. When real auth validation is involved, the call path ultimately leads to KsqlAuthorizationValidatorImpl#checkAccess , which describes a topic by talking to kafka

 private void checkAccess(
      final ServiceContext serviceContext,
      final String topicName,
      final AclOperation operation
  ) {
    final Set<AclOperation> authorizedOperations = serviceContext.getTopicClient()
        .describeTopic(topicName).authorizedOperations();
    ... 

We need to fail pull queries or skip validation, to avoid creating clients per request again.

vinothchandar commented 4 years ago

cc @spena @vpapavas

big-andy-coates commented 4 years ago

We need to fail pull queries or skip validation, to avoid creating clients per request again.

If I read this right, because we have a known performance hit from checking the user has the rights to access the data the pull query is using you're saying we should either:

a) Fail the pull query, or b) skip validation.

This seems a very strange stance to take.

Why would we fail the pull query? What if the user is happy to take the perf hit and just wants the pull query to work?

If we skip validation we're introducing a security hole, (it would reintroduce https://github.com/confluentinc/ksql/issues/3772). It's a security hole with the current pull query impl, as you're allowing the user access to data KSQL, that they don't have permission to access from Kafka, (I think it helps to think of the materialized state in KSQL as a cache of the data in Kafka. KSQL could build the state from the state in Kafka if it wanted: using the pre-materialized state is, in essence, just an optimisation). Plus, it's certainly a security hole as we move forward with pull queries that will actually access Kafka topics, i.e. pull queries against none-materialized state.

Surely the fix here is to introduce some level of caching for the auth responses so that we don't need to check auth on every request.

vinothchandar commented 4 years ago

What if the user is happy to take the perf hit and just wants the pull query to work?

If we create an admin client every request and making a network call, you are also creating a socket and with even reasonable volume of requests/sec, it will connection storm kafka and cause stability pains. (https://kafka.apache.org/documentation/ and search for "connection storms") . Ultimately, the user will end up turning off pull queries anyway, so why give them headaches. This is why we must do this before releasing. On perf front, this is significant and drops the qps by 60% and < 1000 qps/box or so and I have explicit guidance to make the perf reasonable at launch.

Surely the fix here is to introduce some level of caching for the auth responses so that we don't need to check auth on every request.

Great. Looks like we hit the last option from your earlier list of things to try. https://github.com/confluentinc/ksql/issues/3663#issuecomment-546340321 .

apurvam commented 4 years ago

As @vinothchandar said, Kafka is known to be very sensitive to connection storms. Internal experience running kafka in the cloud shows this to be empirically true.

Creating a new connection on each pull query when we can reasonably expect a high volume is going to cause instability on Kafka. Further the admin calls to describe the ACLs only go to the controller, which isn’t sharded. This creates another hot spot which won’t scale for a high volume of queries. Further the availability of the controller is vital to the availability of the cluster. So instability there is even worse.

I think we should fix this properly (cached authorization results shared across requests), but in the meantime make sure that we don’t de stabilize existing Kafka installations: the latter would be worse for adoption IMO.

big-andy-coates commented 4 years ago

Personally, I still think it's a shame to disable something that may be of use to people for their low volume / testing without issue.

We could call this out in the release notes, or have it disabled in the default config with a warning in the comment etc. This would be better than disabling out right IMHO.

Why not introduce a second config to control if pull queries are enabled if custom auth handler is installed?