2ndQuadrant / audit-trigger

Simple, easily customised trigger-based auditing for PostgreSQL (Postgres). See also pgaudit.
Other
657 stars 241 forks source link

Normalize client query-level columns to avoid redundancy #45

Closed manderswy closed 2 years ago

manderswy commented 2 years ago

We implemented an auditing system per the code provided here. One issue we've had, though, is that client-level info (client_query, client_addr, client_port, action_tstamp_tx, etc.) is repeated for each row that is updated based on a given transaction. Under certain usage patterns, this wouldn't matter much, but in our case it has caused our audit.logged_actions table to be extremely bloated. As an example of the kind of transaction that has caused serious bloat, a backend user may update a bunch of rows using something like the following:

UPDATE foo SET bar1='updated bar 1', bar2 = ... WHERE bar_id = 1;
UPDATE foo SET bar1='updated bar 2', bar2 = ... WHERE bar_id = 2;
...
UPDATE foo SET bar1='updated bar 100000', bar2 =... WHERE bar_id = 100000;

Run as a single transaction, this leads to 100,000 rows in the logged_actions table, with the same values (i.e., all lines as indicated above) for client_query, plus repeated client_addr, client_port, transaction_id, action_tstamp_tx, etc.), in every logged_actions row. That redundancy doesn't matter much for most of those columns, but the client_query column repeats the 100,000 line-long statement for every row, in the example above. When the data being updated are significant from a size perspective (e.g., each row contains many columns with varchar types), that adds up to lots of diskspace being used with redundant client_query info.

By way of explanation, we aggregate data from other live databases to which we don't have direct access or connection. Sometimes those databases modify records on their side, so we periodically (e.g., annually) update our records to reflect those changes, using a unique ID to compare and update data. Our older workflows involved users constructing update statements using Excel or some other means outside the database, then running these via pgAdmin.

I am trying to move backend users away from doing things in that manner, so that each update or delete statement is handled as a separate transaction, to avoid that kind of duplication in the client_query column, but we're stuck dealing with the existing records in a logged_actions table that is around 56 GB (for sake of comparison, that is almost half the total size of our database). Even with more normal/sane usage patterns, it seems like there's significant DRY violation happening here. In a case where updates are done by referencing a temporary table after an import from CSV, for example, the client_query (as in the example shown below) still gets unnecessarily duplicated for each row that's updated as a result, e.g.,

UPDATE foo
    SET 
        foo.col_1 = update_table.col_1 
        foo.col_2 = update_table.col_2 
        foo.col_3 = update_table.col_3
        ...
        foo.col_n = update_table.col_n
    FROM update_table
WHERE update_table.id = foo.id;

I'm wondering whether you've considered refactoring this so that the client-level info (client_query, etc.) is written to one table and the actual information being audited (schema_name, table_name, action, row_data, changed_field, etc.) is written to another, with the latter just referencing via FK a row in the former? It seems that the trigger function could be modified so that it would write client-level information to that table and return a unique primary key (if it did not already exist there -- otherwise it'd return the primary key) and then use that primary key as the FK in an "event" level table that tracks the actual changes.

manderswy commented 2 years ago

In case anyone is interested, I normalized our audit tables as I suggested in the issue above. The two new tables, and the revised auditing function are provided here, for reference. For our particular database, this took the auditing system from around 70 GB to around 14 GB.

My main concern was whether this would cause performance issues, since the trigger function has to check for the existence of the client_query to avoid writing it more than once in the case of a client query that modifies multiple rows. I was pleasantly surprised that it seemed to have no noticeable impact on speed, even with a half-million rows to compare to in the client_query table. Note that I did use a little trick that likely helps, there -- I am storing the hash (via md5()) of the client_query in that table, and am comparing that with a hashed client_query in the trigger function. There's an index on client_query.client_query_hash to make that comparison go quickly. I went with that approach given how massive some of the client_query values were.

I'll close this, but wanted to provide the solution I came up with, in case it's of any use to others, or if it's something that might be considered for future development on this repo.

CREATE TABLE IF NOT EXISTS audit.client_query
(
    query_id bigint NOT NULL DEFAULT nextval('audit.client_query_query_id_seq'::regclass),
    client_query text,
    client_query_hash varchar(32),
    CONSTRAINT query_id PRIMARY KEY (query_id)

);

CREATE TABLE IF NOT EXISTS audit.event
(
    event_id bigint NOT NULL DEFAULT nextval('audit.event_event_id_seq'::regclass),
    schema_name text COLLATE pg_catalog."default" NOT NULL,
    table_name text COLLATE pg_catalog."default" NOT NULL,
    relid oid NOT NULL,
    session_user_name text COLLATE pg_catalog."default",
    action_tstamp_tx timestamp with time zone NOT NULL,
    action_tstamp_stm timestamp with time zone NOT NULL,
    action_tstamp_clk timestamp with time zone NOT NULL,
    transaction_id bigint,
    application_name text COLLATE pg_catalog."default",
    client_addr inet,
    client_port integer,
    action text COLLATE pg_catalog."default" NOT NULL,
    row_data hstore,
    changed_fields hstore,
    statement_only boolean NOT NULL,
    query_id bigint REFERENCES audit.client_query(query_id),
    CONSTRAINT event_pkey PRIMARY KEY (event_id),
    CONSTRAINT event_action_check CHECK (action = ANY (ARRAY['I'::text, 'D'::text, 'U'::text, 'T'::text]))
);

CREATE OR REPLACE FUNCTION audit.if_modified_func()
    RETURNS trigger
    LANGUAGE 'plpgsql'
    COST 100
    VOLATILE NOT LEAKPROOF SECURITY DEFINER
    SET search_path=pg_catalog, public
AS $BODY$
DECLARE
    client_query_row audit.client_query;
    event_row audit.event;
    include_values boolean;
    log_diffs boolean;
    hash_of_client_query varchar(32);
    h_old hstore;
    h_new hstore;
    excluded_cols text[] = ARRAY[]::text[];
BEGIN
    IF TG_WHEN <> 'AFTER' THEN
        RAISE EXCEPTION 'audit.if_modified_func() may only run as an AFTER trigger';
    END IF;

    hash_of_client_query = md5(current_query());

    client_query_row = ROW(
        nextval('audit.client_query_query_id_seq'),
        current_query(),
        hash_of_client_query
    );

    event_row = ROW(
        nextval('audit.event_event_id_seq'),          -- event_id
        TG_TABLE_SCHEMA::text,                        -- schema_name
        TG_TABLE_NAME::text,                          -- table_name
        TG_RELID,                                     -- relation OID for much quicker searches
        session_user::text,                           -- session_user_name
        current_timestamp,                            -- action_tstamp_tx
        statement_timestamp(),                        -- action_tstamp_stm
        clock_timestamp(),                            -- action_tstamp_clk
        txid_current(),                               -- transaction ID
        current_setting('application_name'),          -- client application
        inet_client_addr(),                           -- client_addr
        inet_client_port(),                           -- client_port
        substring(TG_OP,1,1),                         -- action
        NULL, NULL,                                   -- row_data, changed_fields
        'f',                                          -- statement_only
        NULL::int                                     -- NULL client_query_id we will get later on
        );

    IF NOT TG_ARGV[0]::boolean IS DISTINCT FROM 'f'::boolean THEN
        event_row.client_query = NULL;
    END IF;

    IF TG_ARGV[1] IS NOT NULL THEN
        excluded_cols = TG_ARGV[1]::text[];
    END IF;

    IF (TG_OP = 'UPDATE' AND TG_LEVEL = 'ROW') THEN
        event_row.row_data = hstore(OLD.*) - excluded_cols;
        event_row.changed_fields =  (hstore(NEW.*) - event_row.row_data) - excluded_cols;
        IF event_row.changed_fields = hstore('') THEN
            -- All changed fields are ignored. Skip this update.
            RETURN NULL;
        END IF;
    ELSIF (TG_OP = 'DELETE' AND TG_LEVEL = 'ROW') THEN
        event_row.row_data = hstore(OLD.*) - excluded_cols;
    ELSIF (TG_OP = 'INSERT' AND TG_LEVEL = 'ROW') THEN
        event_row.row_data = hstore(NEW.*) - excluded_cols;
    ELSIF (TG_LEVEL = 'STATEMENT' AND TG_OP IN ('INSERT','UPDATE','DELETE','TRUNCATE')) THEN
        event_row.statement_only = 't';
    ELSE
        RAISE EXCEPTION '[audit.if_modified_func] - Trigger func added as trigger for unhandled case: %, %',TG_OP, TG_LEVEL;
        RETURN NULL;
    END IF;
    IF (SELECT COUNT(*) FROM audit.client_query WHERE client_query_hash = hash_of_client_query) THEN --IF THE QUERY ALREADY EXISTS, JUST LINK TO IT
        SELECT query_id FROM audit.client_query WHERE client_query_hash = hash_of_client_query INTO event_row.query_id;
    ELSE --IF THIS QUERY DOESN'T ALREADY EXIST, WRITE IT
        INSERT INTO audit.client_query VALUES (client_query_row.*) RETURNING query_id INTO event_row.query_id;
    END IF;
    INSERT INTO audit.event VALUES (event_row.*);
    RETURN NULL;
END;
$BODY$;