The audit log can be rendered unusable if sufficiently many events have been recorded in a short period of time.
We noticed on the attack system that the audit log failed to load.
The error message is:
2021-02-11 15:58:01.239 [error] Postgrex.Protocol (#PID<0.2682.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.4650.0> timed out because it queued and checked out the connection for longer than 15000ms
#PID<0.4650.0> was at location:
(stdlib 3.13.2) gen.erl:208: :gen.do_call/4
(stdlib 3.13.2) gen_statem.erl:671: :gen_statem.call_dirty/4
(ssl 10.1) ssl_connection.erl:2576: :ssl_connection.call/2
(postgrex 0.15.7) lib/postgrex/protocol.ex:2838: Postgrex.Protocol.msg_recv/4
(postgrex 0.15.7) lib/postgrex/protocol.ex:1880: Postgrex.Protocol.recv_bind/3
(postgrex 0.15.7) lib/postgrex/protocol.ex:1772: Postgrex.Protocol.bind_execute/4
(db_connection 2.3.0) lib/db_connection/holder.ex:316: DBConnection.Holder.holder_apply/4
(db_connection 2.3.0) lib/db_connection.ex:1254: DBConnection.run_execute/5
2021-02-11 15:58:01.240 [error] #PID<0.4650.0> running AirWeb.Endpoint (connection #PID<0.4649.0>, stream id 1) terminated
Server: attack.aircloak.com:80 (http)
Request: GET /admin/audit_log
** (exit) an exception was raised:
** (DBConnection.ConnectionError) ssl recv: closed (the connection was closed by the pool, possibly due to a timeout or because the pool has been terminated)
(ecto_sql 3.5.3) lib/ecto/adapters/sql.ex:751: Ecto.Adapters.SQL.raise_sql_call_error/1
(ecto_sql 3.5.3) lib/ecto/adapters/sql.ex:684: Ecto.Adapters.SQL.execute/5
(ecto 3.5.4) lib/ecto/repo/queryable.ex:229: Ecto.Repo.Queryable.execute/4
(ecto 3.5.4) lib/ecto/repo/queryable.ex:17: Ecto.Repo.Queryable.all/3
(air 21.1.0) lib/air/service/audit_log.ex:250: Air.Service.AuditLog.fetch_paginated/2
(air 21.1.0) lib/air/service/audit_log.ex:118: Air.Service.AuditLog.grouped_for/1
(air 21.1.0) lib/air_web/live/admin/audit_log_live/index.ex:36: AirWeb.Admin.AuditLogLive.Index.handle_params/3
(phoenix_live_view 0.13.3) lib/phoenix_live_view/static.ex:303: Phoenix.LiveView.Static.call_mount_and_handle_params!/5
The number of audit log events within the default time window we load events for (1 month) was close to 2 million (which doesn't seem excessive...):
attack_air=# SELECT count(*) FROM audit_logs where inserted_at > NOW() - interval '1 month';
count
---------
1937008
(1 row)
Verdict:
as the audit log grows, the database query we execute on load takes too long to execute. It is terminated and an error page is shown instead
Cap the number of records we load as part of the database request to prevent timeout (worse user experience as the audit log potentially doesn't show as much data as it could), or
Do the aggregation in Elixir instead. This has the benefit that we can stream the data from the database and keep it in memory. As we are showing the audit log as a phoenix live view we could therefore immediately render the screen, and update it as we receive more data. This way the screen will update for a while as the user views it, but we will not run into the database timeout problem (a problem that might arise in this case (as we keep all the data in memory) is that the server might run low on memory if the audit log is truly large.
We now show an error screen if there's too much data to load. The user can narrow the filters down to get only what they need. This may be considered as solved.
The audit log can be rendered unusable if sufficiently many events have been recorded in a short period of time. We noticed on the attack system that the audit log failed to load. The error message is:
The number of audit log events within the default time window we load events for (1 month) was close to 2 million (which doesn't seem excessive...):
Verdict:
Potential solutions: