elastic / elasticsearch

Free and Open, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
68.96k stars 24.49k forks source link

Audit Logging on the IO Thread Appears to Cause Instability #39658

Open original-brownbear opened 5 years ago

original-brownbear commented 5 years ago

While investigating some test slowness/instability today I found that we are running audit logging (including flushing logs to disk) on the transport/IO-thread (the one(s) where the network .select calls happen). This caused the select loop(s) to get blocked for an extended period of time every now and then leading to periods where the network IO was frozen. This is especially visible when running large bulk requests (in the test in question a bulk request of size 10k) (https://github.com/elastic/elasticsearch/issues/39575).

I would argue that this is a bug and blocking disk IO shouldn't be happening on the network IO thread as it can lead to unpredictable latency. One side effect of this, could be (we observed this once) that the introduced latency blocks the IO thread for long enough to make SSL handshakes fail. It seems this was indirectly raised in https://github.com/elastic/elasticsearch/issues/34321 but not investigated.

Stacktrace from Yourkit: Stacks.txt

cc @DaveCTurner @dimitris-athanasiou

elasticmachine commented 5 years ago

Pinging @elastic/es-security

albertzaharovits commented 5 years ago

Good find @original-brownbear !

We haven't taken any precautions about the thread doing the audit write. But we have looked into making sure it is blocking so as to be assured IO errors bubble up if the audit trail cannot be appended.

It seems this was indirectly raised in #34321 but not investigated.

This issue was raised by Yogesh to investigate if the log4j settings for the audit log flush events after every write. This was the default setting and hence the issue was closed.

This is especially visible when running large bulk requests (in the test in question a bulk request of size 10k)

https://github.com/elastic/elasticsearch/pull/36245 (7.0 onward, I think) changed auditing for bulk requests, so that every operation comprising the bulk is audited independently, making this problem observable.

As a remediation, I think we should switch from the network thread as soon as possible, just after completing the request and headers read, but I'm guessing there are technical motivations behind the current implementation (keeping authentication and authz on the network thread)...

original-brownbear commented 5 years ago

As a remediation, I think we should switch from the network thread as soon as possible, just after completing the request and headers read, but I'm guessing there are technical motivations behind the current implementation (keeping authentication and authz on the network thread)...

Yea this is just without alternative imo if you want to write+flush to disk here imo. The motivation to keep things on the network thread is probably just to avoid the context switch and performance penalty that comes with that, but as soon as you start doing things like flushing to disk or any other blocking thing you just have to go to some other thread-pool.

jaymode commented 5 years ago

I'm guessing there are technical motivations behind the current implementation (keeping authentication and authz on the network thread)...

I think it was oversight to be honest. Early on in 5.0 we moved everything to a different thread before authc/authz but that was due to the fact that everything was blocking and causing other issues. We can absolutely move this to a different thread. The question becomes which threadpool do we use and should we have limits?

original-brownbear commented 5 years ago

@jaymode

We can absolutely move this to a different thread. The question becomes which threadpool do we use and should we have limits?

Maybe just the generic thread pool is fine here? It seems limits are probably not necessarily needed since we get the limiting of the write pool in this scenario indirectly anyway?

tvernum commented 5 years ago

The only concern I have is that in a typical deployment, most requests will have successful authentication using cached credentials, and won't be auditing on authentication_success.

If we simply switch to a different thread before authentication, it in effect means that every rest request will incur an unnecessary thread switch (which to the best of my knowledge, we try avoid in the general case).

I don't have a specific alternative, but I'd like us to make sure we're considering what that context switch costs and how we can minimise it.

jaymode commented 5 years ago

I think there are a few things that could be done to handle more cases but not necessarily all:

  1. No thread switch if auditing is disabled
  2. If auditing is enabled and authentication success events are being emitted we switch immediately. I think same goes for access_granted
  3. Make auditing methods asynchronous and switch if on a network thread and auditing. Similar to some work done for LDAP by Albert
  4. There is also connection granted and denied which we should fork to a new thread to avoid the blocking I/O
  5. Custom async logger for auditing only?

The complexity increases as all of these considerations are added

original-brownbear commented 5 years ago

I'm wondering if we can't just switch off of the transport pool for bulk requests in general regardless of the audit logging issue here. I opened https://github.com/elastic/elasticsearch/pull/40866 that does this and added some reasoning for why I think it's ok.

jaymode commented 5 years ago

Given the discussion in #40866 and the direction that PR is headed, bulk requests have their own cause for moving off of a network thread that is separate from auditing the discussion about what to do for auditing should come back to this issue. @jasontedor suggested using a non blocking queue for audit events. One consideration is that most security conscious users would want a requirement that everything be audited and others would want a failure mode that would prevent a request from executing on audit failure. Another option that might be worth considering is looking at the async logging functionality within log4j2.