Wrong diff when storing old data

Hi, first of all, great package.

My use case: I am storing the old data, to make it easier to revert to something. I am storing transaction_id in context, so I can revert all related changes in a request easily. (At least this is the goal)

If I am using the following settings:

PGHISTORY_DEFAULT_TRACKERS = (
    pghistory.InsertEvent(),
    pghistory.UpdateEvent(row=pghistory.Old),
    pghistory.DeleteEvent(row=pghistory.Old),
)

the pgh_diff is wrong, as it fetch the current and previous data of the event. I have changed the raw sql to reach my goal to the this:

WITH pghistory_events AS (SELECT CONCAT('sample.ExampleModelEvent', ':', _pgh_obj_event.pgh_id) AS pgh_slug,
                                 _pgh_obj_event.pgh_id,
                                 _pgh_obj_event.pgh_created_at,
                                 _pgh_obj_event.pgh_label,
                                 _pgh_obj_event.pgh_obj_id,
                                 'sample.ExampleModelEvent'                                     AS pgh_model,
                                 'sample.ExampleModel'                                          AS pgh_obj_model,
                                 (SELECT JSONB_OBJECT_AGG(filtered.key, filtered.value)
                                  FROM (SELECT key,
                                               value
                                        FROM JSONB_EACH(_pgh_obj_event._curr_data::JSONB)) filtered
                                  WHERE filtered.key NOT LIKE 'pgh_%')                          AS pgh_data,
                                 (SELECT JSONB_OBJECT_AGG(filtered.key, filtered.value)
                                  FROM (SELECT key,
                                               value
                                        FROM JSONB_EACH(_pgh_obj_event._prev_data::JSONB)) filtered
                                  WHERE filtered.key NOT LIKE 'pgh_%')                          AS pgh_old_data,
                                 (SELECT JSONB_OBJECT_AGG(curr.key, array [prev.value, curr.value])
                                  FROM (SELECT key,
                                               value
                                        FROM JSONB_EACH(_pgh_obj_event._curr_data::JSONB)) curr
                                           LEFT OUTER JOIN (SELECT key,
                                                                   value
                                                            FROM JSONB_EACH(_pgh_obj_event._prev_data::JSONB)) prev
                                                           ON curr.key = prev.key
                                  WHERE curr.key NOT LIKE 'pgh_%'
                                    AND curr.value != prev.value
                                    AND prev IS NOT NULL)                                       AS pgh_diff,
                                 (SELECT JSONB_OBJECT_AGG(curr.key, array [prev.value, curr.value])
                                  FROM (SELECT key,
                                               value
                                        FROM JSONB_EACH(_pgh_obj_event._real_data::JSONB)) curr
                                           LEFT OUTER JOIN (SELECT key,
                                                                   value
                                                            FROM JSONB_EACH(_pgh_obj_event._prev_data::JSONB)) prev
                                                           ON curr.key = prev.key
                                  WHERE curr.key NOT LIKE 'pgh_%'
                                    AND curr.value != prev.value
                                    AND prev IS NOT NULL)                                       AS pgh_diff_to_real,
                                 _pgh_obj_event.pgh_context_id,
                                 _pgh_obj_event.pgh_context
                          FROM (SELECT pgh_id,
                                       pgh_created_at,
                                       pgh_label,
                                       row_to_json(_event)               AS _prev_data,
                                       COALESCE(
                                                       LEAD(row_to_json(_event))
                                                       OVER (PARTITION BY _event.pgh_obj_id ORDER BY _event.pgh_id),
                                                       (SELECT row_to_json(em)
                                                        FROM sample_examplemodel em
                                                        WHERE em.id = _event.pgh_obj_id)
                                       )                                 AS _curr_data,
                                       (SELECT row_to_json(em)
                                        FROM sample_examplemodel em
                                        WHERE em.id = _event.pgh_obj_id) AS _real_data,
                                       pgh_context_id,
                                       pgh_context,
                                       pgh_obj_id::TEXT
                                FROM sample_examplemodelevent _event
                                ORDER BY _event.pgh_id) _pgh_obj_event)
SELECT "pghistory_events"."pgh_slug",
       "pghistory_events"."pgh_model",
       "pghistory_events"."pgh_id",
       "pghistory_events"."pgh_created_at",
       "pghistory_events"."pgh_label",
       "pghistory_events"."pgh_data",
       "pghistory_events"."pgh_old_data",
       "pghistory_events"."pgh_diff",
       "pghistory_events"."pgh_diff_to_real",
       "pghistory_events"."pgh_context_id",
       "pghistory_events"."pgh_context",
       "pghistory_events"."pgh_obj_model",
       "pghistory_events"."pgh_obj_id"
FROM "pghistory_events"
WHERE "pghistory_events"."pgh_slug" = 'sample.ExampleModelEvent:17'
LIMIT 21

Important part is:

row_to_json(_event)  AS _prev_data,
COALESCE(
    LEAD(row_to_json(_event))
         OVER (PARTITION BY _event.pgh_obj_id ORDER BY _event.pgh_id),
        (SELECT row_to_json(em) FROM sample_examplemodel em WHERE em.id = _event.pgh_obj_id)
    )  AS _curr_data,
(SELECT row_to_json(em) FROM sample_examplemodel em WHERE em.id = _event.pgh_obj_id) AS _real_data,

Also added pgh_old_data, so it is easier to debug. In this setup:

pgh_old_data is the current event data, with Old, this is the previous object state
pgh_data is the next or the real data
phg_diff is the diff between the current and next event data or the real object, depends on existence -> this is the real diff that should be implemented with pghistory.Old
pgh_diff_to_real is the diff between the current state and the change prior to the event
sample_examplemodel is the tracked model
sample_examplemodelevent is the event model

Can we add changes like this optionally? As I understand, all these affect only the admin, and relevant only when using the pghistory.Old

The phg_diff is important, but the diff_to_real is also nice to have, as with that, we can show the diffs between the revert and current state before the user click on the revert.

EDIT: Added important sql sections separately for easier reading

I have move forward a bit with the actual implementation:

    def _get_select(self, event_model):
        where_clause = self._get_where_clause(event_model)

        (
            final_context_columns_clause,
            context_column_clause,
            context_id_column_clause,
            context_join_clause,
            annotated_context_columns_clause,
        ) = self._get_context_clauses(event_model)

        pgh_obj_model = event_model._meta.get_field("pgh_obj").remote_field.model
        pgh_obj_table = pgh_obj_model._meta.db_table
        update_trackers = [
            tracker for tracker in event_model.pgh_trackers if tracker.label == "update"
        ]

        curr_data_clause = "row_to_json(_event) AS _curr_data"
        prev_data_clause = """
            LAG(row_to_json(_event))
              OVER (
                PARTITION BY _event.pgh_obj_id
                ORDER BY _event.pgh_id
              ) AS _prev_data
        """
        pgh_obj_id_column_clause = "pgh_obj_id::TEXT"
        if not hasattr(event_model, "pgh_obj_id"):
            prev_data_clause = "NULL::JSONB AS _prev_data"
            pgh_obj_id_column_clause = "NULL::TEXT AS pgh_obj_id"
        elif update_trackers and update_trackers[0].row == "OLD":  # Check if Old or New
            curr_data_clause = f"""
                COALESCE(
                    LEAD(row_to_json(_event)) OVER (PARTITION BY _event.pgh_obj_id ORDER BY _event.pgh_id),
                    (SELECT row_to_json(m) FROM {pgh_obj_table} m WHERE m.id = _event.pgh_obj_id)
                ) AS _curr_data
            """
            prev_data_clause = "row_to_json(_event) AS _prev_data"

        event_table = event_model._meta.db_table
        return f"""
            SELECT
              CONCAT('{event_model._meta.label}', ':', _pgh_obj_event.pgh_id) AS pgh_slug,
              _pgh_obj_event.pgh_id,
              _pgh_obj_event.pgh_created_at,
              _pgh_obj_event.pgh_label,
              {final_context_columns_clause}
              _pgh_obj_event.pgh_obj_id,
              '{event_model._meta.label}' AS pgh_model,
              '{event_model.pgh_tracked_model._meta.label}' AS pgh_obj_model,
              (
                  SELECT JSONB_OBJECT_AGG(filtered.key, filtered.value)
                  FROM
                    (
                        SELECT key, value
                        FROM JSONB_EACH(_pgh_obj_event._curr_data::JSONB)
                    ) filtered
                  WHERE filtered.key NOT LIKE 'pgh_%%'
              ) AS pgh_data,
              (
                SELECT JSONB_OBJECT_AGG(curr.key, array[prev.value, curr.value])
                FROM
                  (
                    SELECT key, value
                    FROM JSONB_EACH(_pgh_obj_event._curr_data::JSONB)
                  ) curr
                  LEFT OUTER JOIN
                  (
                    SELECT key, value
                    FROM JSONB_EACH(_pgh_obj_event._prev_data::JSONB)
                  ) prev
                  ON curr.key = prev.key
                WHERE curr.key NOT LIKE 'pgh_%%'
                  AND curr.value != prev.value
                  AND prev IS NOT NULL
              ) AS pgh_diff,
              _pgh_obj_event.pgh_context_id,
              _pgh_obj_event.pgh_context
            FROM (
              SELECT
                pgh_id,
                pgh_created_at,
                pgh_label,
                {curr_data_clause},
                {annotated_context_columns_clause}
                {prev_data_clause},
                {context_id_column_clause},
                {context_column_clause},
                {pgh_obj_id_column_clause}
              FROM {event_table} _event
              {context_join_clause}
              {where_clause}
              ORDER BY _event.pgh_id
            ) _pgh_obj_event
        """

I can open a PR and update to align with project scopes as needed.

This part is not quite clear for me

if not hasattr(event_model, "pgh_obj_id"):
            prev_data_clause = "NULL::JSONB AS _prev_data"
            pgh_obj_id_column_clause = "NULL::TEXT AS pgh_obj_id"

@bgervan I'm not following what is "wrong". What is the error? Can you boil it down to a working example to illustrate what results you are seeing and what you are expecting to see?

The aggregate event model is just a way to quickly query diffs, mostly for the admin, but some people rely on this utility in their application code. If you are wanting to fundamentally alter behavior to match the scope of a project you're doing, it would probably be best to inherit the model and override it accordingly unless the changes you want are broadly applicable to many use cases.

If you can clarify if this is a bug or a feature request, that would be helpful, thanks!

I created a sample model with

PGHISTORY_DEFAULT_TRACKERS = (
    pghistory.InsertEvent(),
    pghistory.UpdateEvent(row=pghistory.Old),
    pghistory.DeleteEvent(row=pghistory.Old),
)

Insert: name: Sample Update 1: name: Sample (update1) Update. 2: name: Sample (update2)

With the above changes, the diffs are the following respectively:

null

null

{
    "name": [
        "Sample",
        "Sample (update1)"
    ],
    "updated_at": [
        "2024-10-10T05:21:29.032954+00:00",
        "2024-10-10T05:21:38.175114+00:00"
    ]
}

So when I updated to Sample (update2), I have the diff of the previous change, which is a bug with the pghistory.Old settings. For update1, I receive null, when there is a change.

With my update, the diffs are the following:

In [3]: MyEvents.objects.tracks(example)[0].pgh_diff

In [4]: MyEvents.objects.tracks(example)[1].pgh_diff
Out[4]: 
{
   'name': ['Sample', 'Sample (update1)'],
   'updated_at': [
       '2024-10-10T05:21:29.032954+00:00',
       '2024-10-10T05:21:38.175114+00:00'
   ]
}

In [5]: MyEvents.objects.tracks(example)[2].pgh_diff
Out[5]: 
{
  'name': ['Sample (update1)', 'Sample (update2)'],
  'updated_at': [
       '2024-10-10T05:21:38.175114+00:00',
       '2024-10-10T05:21:43.575616+00:00'
   ]
}

So for insert it is still NULL (correct) and for the updates, it shows the correct values.

The fix is simply shifts the comparison objects from last and current, to current and next

You are seeing these results because you overrode the default trackers.

If you are tracking inserts and updates of the OLD row, this means the first two tracked rows in the history table will be identical, which is why you see two null values first - there are no changes yet.

For example, say you have a row with col_1.

INSERT of col_1=a will produce tracked history of col_1=a
UPDATE of col_1=b will track the OLD row before the change and produce a tracked history of col_1=a
UPDATE of col_1=c will track the OLD row before the change and produce a tracked history of col_1=b

So, in the diff, you'll see null, null, and then col_1=a->b

The aggregated event utility is just looking at changes in the history rows. It doesn't understand the context of the trackers that have been set up

I understood what happens, that is what my fix trying to update, checking the tracker if its NEW or OLD and return the diff accordingly.

You don't consider it as a bug when the user uses the row=pghistory.Old? I found a couple of issue with my fix, but fixed since the first implementation.

If you are tracking OLD rows on update, your snapshot table lags behind the current table. The aggregate queryset that shows diffs only examines snapshot rows, so what you're seeing is expected behavior.

One alternative approach is this:

PGHISTORY_DEFAULT_TRACKERS = (
    pghistory.InsertEvent(),
    pghistory.UpdateEvent(row=pghistory.New),
    pghistory.DeleteEvent(row=pghistory.Old),
)

If you do this, the example from my previous comment now looks like this:

INSERT of col_1=a will produce a snapshot of col_1=a
UPDATE of col_1=b will track the NEW row after the change and produce a snapshot of col_1=b
UPDATE of col_1=c will track the NEW row after the change and produce a snapshot of col_1=c

I.e. your snapshot table will actually show differences in the first two snapshot entries.

Another possibility is to filter out inserts from the snapshots table or not track them at all, but your diffs will always lag behind the current version of the row

To summarize, the aggregate diff utility is just showing diffs from the snapshot table. What's been reported here is expected behavior. Altering it would introduce regressions for others and require an API break. I'm open to altering the behavior, but this is pretty fundamental to how things are architected. If we try to pull in current values not from snapshot tables, it opens up a can of worms for other use cases

Opus10 / django-pghistory

Wrong diff when storing old data #171