adamcharnock / django-hordak

Double entry accounting in Django
http://django-hordak.readthedocs.io
MIT License
238 stars 55 forks source link

Replace use of django-sql-utils with a TransactionView database view #127

Closed adamcharnock closed 3 months ago

adamcharnock commented 3 months ago

Removing use of django-sql-utils (sql_util) as the admin transaction listing UI is now completing in O(1) time complexity (see test).

adamcharnock commented 3 months ago

@PetrDlouhy – I'm not entirely sure why django-sql-utils is no longer providing a performance boost, so please let me know if I am missing something obvious.

PetrDlouhy commented 3 months ago

@adamcharnock The django-sql-utils didn't reduce number of DB queries, but made them faster. The SubquerySum does get the related records from DB in subquery rather than by joining tables, which could be sometimes faster.

This would need to be tested on large dataset to show the speed benefit. If you want, I can dive deep and provide timing data for my database.

adamcharnock commented 3 months ago

Ah ok, that makes sense. I've created the create_benchmark_transactions command, and with that I have created 300k transactions / 600k legs. With those records I get a load time of around 350ms on the Transactions list admin page.

What kind numbers of transactions/legs are you dealing with on your side?

Update: Just seen you comment on another PR:

Ours service has ~1M users, (x2 Hordak accounts), ~13M transactions (x2 legs). Some of the accounts have much larger number of accounts than others, I would expect that maximum can reach 1M transactions per accont. But I had to solve the performance issues few years back, so I expect at ~10x less data.

We are using PostgreSQL, so PSQL only is not a problem for us. I am not sure, if DB functions would by fast enough, but I can do some tests.

adamcharnock commented 3 months ago

So, given the above, it looks like I need to scale up by a couple orders of magnitude. I'll come back to this and give it some more thought.

adamcharnock commented 3 months ago

Ok, I've scaled up by a factor of 10x and what did take around 300ms now takes around 3 seconds. So at your 100x scale this is certainly an issue.

I'll see if I can figure out something.

adamcharnock commented 3 months ago

Hi @PetrDlouhy, would you have a moment to run this query against your database?

SELECT
    T.*,
    jsonb_agg(jsonb_build_object('amount', L.amount, 'currency', L.currency)) as balance
FROM
    hordak_transaction T
INNER JOIN LATERAL (
    SELECT SUM(amount) AS amount, amount_currency AS currency
    FROM hordak_leg L
    WHERE L.transaction_id = T.id AND L.amount > 0
    GROUP BY amount_currency
    ) L ON True
GROUP BY T.id, T.uuid, T.timestamp, T.date, T.description;

I'm getting a result in 144ms on my benchmark DB (3M transactions, 6M legs) for 500 transactions. If I run the query over all 3M transactions it returns in around 4-5 seconds.

If this works well then I'm thinking this would make a good database view, which the admin listing could then pull from. We could also put any other useful aggregate fields in there too.

PetrDlouhy commented 3 months ago

If I run it for 10,000 accounts it is super quick:

explain analyze SELECT
    T.*,
    jsonb_agg(jsonb_build_object('amount', L.amount, 'currency', L.currency)) as balance
FROM
    hordak_transaction T
INNER JOIN LATERAL (
    SELECT SUM(amount) AS amount, amount_currency AS currency
    FROM hordak_leg L
    WHERE L.transaction_id = T.id AND L.amount > 0
    GROUP BY amount_currency
    ) L ON True
GROUP BY T.id, T.uuid, T.timestamp, T.date, T.description limit 10000;
                                                                                   QUERY PLAN                                                                                    
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=4.81..47559.33 rows=10000 width=93) (actual time=0.070..77.373 rows=10000 loops=1)
   ->  GroupAggregate  (cost=4.81..61231339.06 rows=12876028 width=93) (actual time=0.069..76.287 rows=10000 loops=1)
         Group Key: t.id
         ->  Nested Loop  (cost=4.81..61166958.92 rows=12876028 width=97) (actual time=0.040..43.418 rows=10001 loops=1)
               ->  Index Scan using hordak_transaction_pkey on hordak_transaction t  (cost=0.09..282032.19 rows=12876028 width=61) (actual time=0.016..2.520 rows=10001 loops=1)
               ->  HashAggregate  (cost=4.72..4.72 rows=1 width=36) (actual time=0.004..0.004 rows=1 loops=10001)
                     Group Key: l.amount_currency
                     Batches: 1  Memory Usage: 24kB
                     ->  Index Scan using idx_hordak_leg_transaction_id on hordak_leg l  (cost=0.09..4.70 rows=16 width=10) (actual time=0.002..0.003 rows=1 loops=10001)
                           Index Cond: (transaction_id = t.id)
                           Filter: (amount > '0'::numeric)
                           Rows Removed by Filter: 2
 Planning Time: 0.232 ms
 Execution Time: 78.118 ms
(14 rows)

If I run it for whole database, it is much slower:

explain analyze SELECT
    T.*,
    jsonb_agg(jsonb_build_object('amount', L.amount, 'currency', L.currency)) as balance
FROM
    hordak_transaction T
INNER JOIN LATERAL (
    SELECT SUM(amount) AS amount, amount_currency AS currency
    FROM hordak_leg L
    WHERE L.transaction_id = T.id AND L.amount > 0
    GROUP BY amount_currency
    ) L ON True
GROUP BY T.id, T.uuid, T.timestamp, T.date, T.description;
                                                                                   QUERY PLAN                                                                                    
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=4.81..61231339.06 rows=12876028 width=93) (actual time=203.820..132172.565 rows=12916116 loops=1)
   Group Key: t.id
   ->  Nested Loop  (cost=4.81..61166958.92 rows=12876028 width=97) (actual time=203.788..88950.936 rows=12916116 loops=1)
         ->  Index Scan using hordak_transaction_pkey on hordak_transaction t  (cost=0.09..282032.19 rows=12876028 width=61) (actual time=0.015..4760.019 rows=12916119 loops=1)
         ->  HashAggregate  (cost=4.72..4.72 rows=1 width=36) (actual time=0.006..0.006 rows=1 loops=12916119)
               Group Key: l.amount_currency
               Batches: 1  Memory Usage: 24kB
               ->  Index Scan using idx_hordak_leg_transaction_id on hordak_leg l  (cost=0.09..4.70 rows=16 width=10) (actual time=0.003..0.005 rows=1 loops=12916119)
                     Index Cond: (transaction_id = t.id)
                     Filter: (amount > '0'::numeric)
                     Rows Removed by Filter: 3
 Planning Time: 0.191 ms
 JIT:
   Functions: 17
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 2.298 ms, Inlining 36.786 ms, Optimization 105.760 ms, Emission 61.021 ms, Total 205.866 ms
 Execution Time: 132947.094 ms
(17 rows)
PetrDlouhy commented 3 months ago

But I expect that is fine, at least for me. If I remember correctly, I don't need to access all balances at once. I usually only need balance for one account, the maximum is admin view with 100 accounts.

adamcharnock commented 3 months ago

Great! And it sounds like it is mostly just scaling linearly with number of rows. 1000x (ish) times the rows takes 1000x (ish) times longer, which I feel OK about.

Ok, I've already got the migration in place locally to create the view. I'll get the admin pulling data from it then see how things look.

adamcharnock commented 3 months ago

This is looking read now. Notes:

  1. We don't calculate multi-currency transaction totals in MySQL (if someone can make a PR for this then great, but it was not straightforward to implement for me). In Postgres it works fine.
  2. This represents a very efficient way to calculate transaction totals, although it will likely still take multiple minutes for 10M+ transaction.
  3. The reason it is a little more complex than one may expect is because of multi-currency transactions.

(Will update comment as I think of more)