DoS attack on audit log

The audit log can be rendered unusable if sufficiently many events have been recorded in a short period of time. We noticed on the attack system that the audit log failed to load. The error message is:

2021-02-11 15:58:01.239 [error] Postgrex.Protocol (#PID<0.2682.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.4650.0> timed out because it queued and checked out the connection for longer than 15000ms
#PID<0.4650.0> was at location:
    (stdlib 3.13.2) gen.erl:208: :gen.do_call/4
    (stdlib 3.13.2) gen_statem.erl:671: :gen_statem.call_dirty/4
    (ssl 10.1) ssl_connection.erl:2576: :ssl_connection.call/2
    (postgrex 0.15.7) lib/postgrex/protocol.ex:2838: Postgrex.Protocol.msg_recv/4
    (postgrex 0.15.7) lib/postgrex/protocol.ex:1880: Postgrex.Protocol.recv_bind/3
    (postgrex 0.15.7) lib/postgrex/protocol.ex:1772: Postgrex.Protocol.bind_execute/4
    (db_connection 2.3.0) lib/db_connection/holder.ex:316: DBConnection.Holder.holder_apply/4
    (db_connection 2.3.0) lib/db_connection.ex:1254: DBConnection.run_execute/5
2021-02-11 15:58:01.240 [error] #PID<0.4650.0> running AirWeb.Endpoint (connection #PID<0.4649.0>, stream id 1) terminated
Server: attack.aircloak.com:80 (http)
Request: GET /admin/audit_log
** (exit) an exception was raised:
    ** (DBConnection.ConnectionError) ssl recv: closed (the connection was closed by the pool, possibly due to a timeout or because the pool has been terminated)
        (ecto_sql 3.5.3) lib/ecto/adapters/sql.ex:751: Ecto.Adapters.SQL.raise_sql_call_error/1
        (ecto_sql 3.5.3) lib/ecto/adapters/sql.ex:684: Ecto.Adapters.SQL.execute/5
        (ecto 3.5.4) lib/ecto/repo/queryable.ex:229: Ecto.Repo.Queryable.execute/4
        (ecto 3.5.4) lib/ecto/repo/queryable.ex:17: Ecto.Repo.Queryable.all/3
        (air 21.1.0) lib/air/service/audit_log.ex:250: Air.Service.AuditLog.fetch_paginated/2
        (air 21.1.0) lib/air/service/audit_log.ex:118: Air.Service.AuditLog.grouped_for/1
        (air 21.1.0) lib/air_web/live/admin/audit_log_live/index.ex:36: AirWeb.Admin.AuditLogLive.Index.handle_params/3
        (phoenix_live_view 0.13.3) lib/phoenix_live_view/static.ex:303: Phoenix.LiveView.Static.call_mount_and_handle_params!/5

The number of audit log events within the default time window we load events for (1 month) was close to 2 million (which doesn't seem excessive...):

attack_air=# SELECT count(*) FROM audit_logs where inserted_at > NOW() - interval '1 month';
  count
---------
 1937008
(1 row)

Verdict:

as the audit log grows, the database query we execute on load takes too long to execute. It is terminated and an error page is shown instead
I believe this is the offending query: https://github.com/Aircloak/aircloak/blob/master/air/lib/air/service/audit_log.ex#L261-L296. The complex part is the window function which groups together similar events

Potential solutions:

Cap the number of records we load as part of the database request to prevent timeout (worse user experience as the audit log potentially doesn't show as much data as it could), or
Do the aggregation in Elixir instead. This has the benefit that we can stream the data from the database and keep it in memory. As we are showing the audit log as a phoenix live view we could therefore immediately render the screen, and update it as we receive more data. This way the screen will update for a while as the user views it, but we will not run into the database timeout problem (a problem that might arise in this case (as we keep all the data in memory) is that the server might run low on memory if the audit log is truly large.

Aircloak / aircloak

DoS attack on audit log #4886