ansible-community / ara

ARA Records Ansible and makes it easier to understand and troubleshoot.
https://ara.recordsansible.org
GNU General Public License v3.0
1.88k stars 174 forks source link

ARA authentication slows down ansible #283

Closed pescobar closed 3 years ago

pescobar commented 3 years ago

I have deployed ARA and I have noticed that the ansible execution time increases a lot when I enable ARA authentication.

The deployment is is based on docker, using the official ARA image, postgres as db backend and nginx as a reverse proxy in front. I am using this compose file slightly modified https://gist.github.com/pescobar/c4668bfc4f8bec86bdab06cb3eae3428

The ARA server is located in a different machine so the ansible controller is pushing the data over the network using https.

All the tests are done executing the same playbook.

This is the ansible execution time without ARA:

real    6m29.206s
user    2m30.816s
sys     0m44.413s

This is the ansible execution time using ara WITHOUT authentication:

real    7m34.077s
user    2m38.239s
sys     0m48.011s

This is the ansible execution time using ara WITH authenticaion:

real    13m6.823s
user    2m38.622s
sys     0m50.371s

Is there any workaround for this? Any setting I can tweak to improve performance when ara authentication is enabled?

dmsimard commented 3 years ago

Hi @pescobar and thanks for the issue.

I'll preface this with a blog post that I have written a while back to calculate the performance overhead of the callback: https://ara.recordsansible.org/blog/2020/11/01/benchmarking-ansible-and-ara-for-fun-and-science/

It turns out that I had not benchmarked with and without authentication so thanks for providing some metrics.

I am not well informed about the potential overhead of supplying authentication via python's requests library and how it might impact django performance but we can learn together :P

First, some questions:

pescobar commented 3 years ago

Thanks for your reply @dmsimard, I had missed the option ARA_CALLBACK_THREADS

I am using django as authentication backend. I added the users to django as described in the docs.

I did some tests increasing the number of threads and these are the numbers:

without authentication and 4 threads

real    6m56.702s
user    2m42.072s
sys     0m52.565s

without authentication and 8 threads

real    6m58.796s
user    2m43.526s
sys     0m54.924s

with authentication and 8 threads

real    9m44.719s
user    2m40.849s
sys     0m55.699s

There is an improvement but still the performance is significantly worse when using authentication

The test playbook is only being applied to 5 hosts so maybe that's the reason why the difference between 4 and 8 threads is so small?

On a side note, I have noticed that once I disabled authentication I was not able to enable it again until I did a redeploy from scratch but maybe this is something for a different issue. I will do some more testing about this and will report back.

dmsimard commented 3 years ago

Thanks for the info.

The test playbook is only being applied to 5 hosts so maybe that's the reason why the difference between 4 and 8 threads is so small?

That's because the callback doesn't allow more than four threads for the time being: https://github.com/ansible-community/ara/blob/31f090e4119d1ac55a9807b9e605082cdd3db1b9/ara/plugins/callback/ara_default.py#L233-L238

The difference in performance from single thread to four threads is most noticeable against a larger amount of hosts.

On a side note, I have noticed that once I disabled authentication I was not able to enable it again until I did a redeploy from scratch but maybe this is something for a different issue. I will do some more testing about this and will report back.

That doesn't ring me a bell but can be a number of things -- if you are running in containers, make sure to modify the configuration file (usually mounted via a volume) and then the wsgi server (or in your case, container) needs to be restarted to pick up the configuration change. Let me know if you find anything and we can investigate.

Otherwise, I am not sure if there is a workaround for the authentication performance overhead yet. If you would like to look, it goes a bit like this:

Something that comes to mind is that maybe there is a notion of cache or cookie ? Is it re-authenticating every time ? Should it authenticate once and then be done with it ?

hille721 commented 3 years ago

We are also using the django authentication for ara in our company but I never recognized a loss of performance.

Yes there is a a loss of performance by using ara, which is logical because of the data transfers, but no difference if ara is running with or without authentication.

If I have time, I will also run some tests.

dmsimard commented 3 years ago

Yes there is a a loss of performance by using ara, which is logical because of the data transfers, but no difference if ara is running with or without authentication.

If I have time, I will also run some tests.

pescobar's tests indicate that there does seem to be a difference in performance based on whether or not authentication is enabled. It would be certainly be useful to know if you are able to reproduce the same results.

Thanks !

hille721 commented 3 years ago

as promised here my test results:

hosts: 311 forks: 100 callback_threads: 4 Ara running on Openshift with external MySQL DB. For the authentication, the Django authentication is used (write authentication only)

No ara: 1m33.176s Ara without auth: 1m55.447s Ara with auth: 5m5.025s

Honestly, I am really shocked... When we started with ara a couple of months ago, we also did some performance test, but haven't regognized that effect. I always thought the overload is simple coming by using ara itself. But in fact it is not ara whats causing the overload, it is really the authentication...

hille721 commented 3 years ago

I guess the reason is kind of logical, using the authentication means that for every API call ara has to do a DB call to verify the username and password. But that this is so much difference is suprising.

Maybe the performance would be better by using token based authentication instead of username, password...

pescobar commented 3 years ago

would it make sense to try to reduce the number of connections that ARA does to the API?

I don't know the internals of ARA so maybe what I suggest doesn't make sense but would it work if ara keeps a cache of data that needs to be pushed to the api and does less connections pushing more data instead of doing a new connection for each executed task? The cache size could even be an ara config option.

This won't solve the authentication overhead but it should provide a general performance improvement, isn't it?

dmsimard commented 3 years ago

would it make sense to try to reduce the number of connections that ARA does to the API?

This is already optimized to some extent. I don't want to say there are no other improvement opportunities but many unnecessary calls to the API were already removed and local caching was added to avoid needing to do additional calls to fetch IDs and such.

I don't know the internals of ARA so maybe what I suggest doesn't make sense but would it work if ara keeps a cache of data that needs to be pushed to the api and does less connections pushing more data instead of doing a new connection for each executed task? The cache size could even be an ara config option.

This is not simple due to the synchronous nature of the callback throughout the execution of the playbook. There are pros and cons to this approach, of course, but it's worked generally well so far.

This won't solve the authentication overhead but it should provide a general performance improvement, isn't it?

There has been discussions in the past to make the ingestion of events optionally asynchronous with something like a message bus (i.e, rabbitmq) but the need has never been sufficient to justify the increase in complexity and no one has been interested enough to work on it.

I like to say that simplicity is a feature in ara so we have to be wary of the tradeoffs or sacrifices in simplicity to make for the benefit of performance.

Threading did not always exist and enabling it yields significant performance benefits. It's good that we know about the authentication overhead now so we can find an approach that works better to improve things.

Maybe it means switching to a different authentication mechanism like tokens or have it managed by apache/nginx instead.

In fact, when I have time (or if someone beats me to it), I would like to test what the performance of a simple apache htaccess/htpasswd using ARA_EXTERNAL_AUTH looks like in comparison.

pescobar commented 3 years ago

I am trying to do some tests using nginx basic auth but I think I am hitting this problem when I enable basic auth in my nginx reverse proxy https://stackoverflow.com/a/22663390

Is there any way to to configure this option in ARA when using the official docker image?

REST_FRAMEWORK = {
    'DEFAULT_AUTHENTICATION_CLASSES': []
}
dmsimard commented 3 years ago

Hi @pescobar,

When the authentication is managed by a webserver in front of django (like apache or nginx), READ_LOGIN_REQUIRED and WRITE_LOGIN_REQUIRED in your server's settings.yaml should both be set to false and EXTERNAL_AUTH to true.

I've just tried it with apache and it works, in essence what I did was: 1) Run a container on port 8000 (podman run --name ara --detach --tty --volume ~/.ara/server:/opt/ara:z -p 8000:8000 quay.io/recordsansible/ara-api:latest) 2) Ensure EXTERNAL_AUTH is true in ~/.ara/server/settings.yaml and restart the container to reload the config 3) Set up apache authentication file with htpasswd -c -m /etc/httpd/.htpasswd testing (type password when prompted) 4) Set up a basic apache vhost with the following:

<VirtualHost *:80>
  ServerName ara.example.org
  ProxyPass / http://127.0.0.1:8000/
  ProxyPassReverse / http://127.0.0.1:8000/

  <Location />
    Deny from all
    AuthUserFile /etc/httpd/.htpasswd
    AuthName "Restricted Area"
    AuthType Basic
    Satisfy Any
    require valid-user
  </Location>
</VirtualHost>

That's it.

If I skip step 2, I still get the authentication prompt but even if I type in the right credentials, it sends me right back to the authentication prompt. Setting EXTERNAL_AUTH addresses that and maybe that is the issue you are seeing with nginx ? There should be no need for a workaround.

dmsimard commented 3 years ago

By the way, I am planning to do a formal benchmark (like here and here) of:

I'm sure the data will be interesting and it might also help find improvement opportunities.

dmsimard commented 3 years ago

I've reproduced the performance degradation when using django's authentication -- thanks for the reporting the issue :+1:

When using the benchmark playbook with 50 tasks and 100 hosts (5000 results):

There is definitely something going on with django's authentication -- maybe it has to do with a database lookup for every call ? It doesn't add a lot of time but even a few milliseconds adds up quickly when there can be thousands of calls throughout the duration of a playbook.

In order to make sure it wasn't a fluke, I added nginx in front of django while letting django handle the authentication and got the same (slow) result.

In comparison, webservers have a flat text file that is surely loaded into memory which is faster.

I might not go as far as to call this a bug but it's certainly worth documenting the behavior and improving the docs around EXTERNAL_AUTH, maybe even recommend it. We have a section in the troubleshooting documentation about performance: https://ara.readthedocs.io/en/latest/troubleshooting.html#degraded-playbook-execution-performance

dmsimard commented 3 years ago

Hi,

I've sent a PR to document the fact that there is a performance overhead when using django authentication and in fact recommend using a server in front to handle authentication for now: https://github.com/ansible-community/ara/pull/319

It includes instructions on how to set up EXTERNAL_AUTH with nginx and apache2.

I still plan on doing more benchmarking in the future but once that PR lands I will consider the issue closed unless someone wants to investigate further.

dmsimard commented 3 years ago

The updated documentation is up at https://ara.readthedocs.io/en/latest/api-security.html#authentication-and-user-management

Specifically the summary portion about performance: Screenshot from 2021-08-04 11-44-16

Thanks for finding out about this and creating the issue, much appreciated.

dmsimard commented 3 years ago

By the way, integration testing with authentication was a gap in the CI pending on https://github.com/ansible-community/ara/issues/39 and https://github.com/ansible-community/ara-collection/issues/4.

I'm happy to report that the following PR adds support AND integration tests for authentication: https://github.com/ansible-community/ara-collection/pull/38

Frazew commented 2 months ago

Hi! Sorry for digging this issue up, feel free to tell me if you'd prefer I open another one.

I believe the issue is that Django uses PBKDF2 for password hashing, which is intentionally very expensive. Since there's no session management, every single request needs to be checked again against the database, which means every request implies running the password through the PBKDF2 function.

Unless I'm somehow mistaken in my tests, this can be validated by telling Django to use MD5:

diff --git a/ara/server/settings.py b/ara/server/settings.py
index 3a0aaac..a859e88 100644
--- a/ara/server/settings.py
+++ b/ara/server/settings.py
@@ -116,6 +116,10 @@ DATABASE_PORT = settings.get("DATABASE_PORT", None)
 DATABASE_CONN_MAX_AGE = settings.get("DATABASE_CONN_MAX_AGE", 0)
 DATABASE_OPTIONS = settings.get("DATABASE_OPTIONS", {})

+PASSWORD_HASHERS = [
+    "django.contrib.auth.hashers.MD5PasswordHasher",
+]
+
 DATABASES = {
     "default": {
         "ENGINE": "ara.server.db.backends.distributed_sqlite" if DISTRIBUTED_SQLITE else DATABASE_ENGINE,

And then creating a superadmin user and using it to capture playbooks. Using ARA with the builtin Django authentication is then orders of magnitude faster (since MD5 is basically free).

I'm not sure what the best path forward would be:

dmsimard commented 2 months ago

Hi @Frazew,

This is a good find and I believe you have a good understanding of the issue.

In regards to your suggested options:

  • implement session management (i.e. the client logs in when the playbook starts and then uses a session cookie to authenticate all subsequent requests?)
  • intentionally weaken the password hashing setting to something less expensive (not great, but very quick solution)
  • something else?

I would not personally encourage users to configure a weaker hash for authentication but if this is something you do not mind, by all means you are free to run with a local patch on your side if that works for you. We can point users to this issue if they are in search of a workaround leveraging django authentication.

From a project standpoint I am not particularly excited in pursuing the development (and maintenance) of session management client/server side with django. This is in large part because web servers and proxies are doing a very good job at handling authentication and I would rather encourage users in this direction since in a production setting they should probably already be running one in front of the server.

For the sake of simplicity there is little to no code that has to deal with authentication in ara: there is no formal RBAC other and read/write, either you have access or you don't and I am OK with that.

Frazew commented 2 months ago

Hi!

Thank you for the detailed response, I understand and fully agree with your stance. My main concern was that some people could discard ARA because of performance issues without realizing that it's actually due to authentication.

If that's ok I'll prepare a tiny PR to update the Troubleshooting documentation section to clarify the following point:

When enabling authentication, consider using EXTERNAL_AUTH instead of the Django built-in user management to avoid a database lookup performance hit on every query

My initial understanding of this sentence when troubleshooting performance was that Django to database latency was the culprit, which led me to initially disregard it (as I knew database latency was fine). I think mentioning password hashing here would clarify that Django auth really is the issue itself

Thank you again!