Open PetrDlouhy opened 1 year ago
Attention: 11 lines
in your changes are missing coverage. Please review.
Comparison is base (
40a4be4
) 92.97% compared to head (fefe20a
) 93.18%.
:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Now I have updated this PR to working state. It is working, but have some problems:
@PetrDlouhy I was thinking about this and would recommend a different approach if you're open to it.
To speed up Balance
calculations with a large number of Leg
s, a different solution would be to use an AccountStatement
record. This comes direct from the StackOverflow post discussed in https://github.com/adamcharnock/django-hordak/issues/44#issuecomment-1476010105:
In the SO post, it's called LedgerStatement
, but for Hordak, it would be called AccountStatement
.
Key Columns of the AccountStatement
balance
- moneybalance_currency
account_id
calculated_to_leg_id
- Points to last leg it includes as part of its balanceaccount_statement_id
- Previous time this account created an AccountStatement
.Functionality
balance
calculation only needs to sum Leg
s from the last AccountStatement
for a given currency. Account
s would only sum its Leg
s that are directly attached.
AccountStatement
every X number of Legs for each currency for that account since the last AccountSatement
.AccountStatementAuditor
or similar would double check accuracy of each AccountStatement.Differences from StackOverflow (SO)
The SO diagram uses date
, while this hard references Leg
s.
Dates are common for like "Monthly Statements" which is great for humans, but that's not what we care about here. Obviously if we used date/time
you could end up with 0 Leg
s between statements or 1b.
We want predicable speed, so by hard linking via calculated_to_leg_id
instead of time we can use "number of legs between each statement" and tune it to 10,000, 100k, 1m? Whatever has a good balance between speed vs number of AccountStatement
created for that DB's instance CPU+HD speed.
Benefits
AccountStatement
is never generated (i.e. background job is never run) then query includes all Leg
s.Drawbacks
Leg
s being updated. In terms of accounting, you should never update historicals, but be adding adjustments/transactions/legs for corrections.update
:AccountStatement
AccountStatements
equal to or older than the Leg being updated.Thoughts?
Some quick thoughts:
@PetrDlouhy thanks for the feedback:
AccountStatement
is a confusing name. Maybe BalanceStatement
?Reservations with current RunningTotals
Agreed that signals makes me pause a bit, but I don't think that's a blocker for this feature.
In my eyes the biggest concern in the current architecture, is that RunningTotal
s is, at its core, a differential cache. These are hard to get right at scale. Edge cases take forever to crop up and many times are insanely difficult to debug.
One is - while it's been correctly identified a table lock
is helpful, adding a sleep(2)
to simulate the connections holding onto the lock
too long (db network issue), the result is weird balance
s.
Separately, looking at django-concurrent-test-helper
there's a note about the methodology of process forking. That it uses threads instead of processes. I'm not familiar enough with psycopg2, django, and python to know if the threading is isolated enough such that the concurrent test is doing exactly what we expect it to do.
Recommendations
update_running_totals()
to a DB function/trigger, resulting in guarantees on the values.As it stands, any organization like our own would not be able to adopt this feature because we rely on Balance
being 100% correct as we move money in bank accounts based the current value of the balance. If a Transaction+Legs
fail to insert because RunningTotal
could not be updated, that's okay/handle-able. It's when there's a failure but it acts like it worked, we have real trouble.
@nitsujri Seems very reasonable.
I am not sure, if the BalanceStatement
s are 100 % prone to rounding errors, but the clean thread safety seems like big benefit. It makes the design more like integral part of the application and less like a caching tool.
Although I am not sure, if I will find enough time for this in the near future.
This discussion sounds interesting and good to me. I have been aware of this possibly becoming a performance issue.
Question: What kind of scale are you seeing this become an issue at? (i.e. roughly how many legs do you have on a big account)
I'm very much in favour of this being done in-database rather than in-django. I think there are a couple of sides of this:
BalanceStatement
records (Is this done online upon insert, or offline as a batch job?)BalanceStatement
records should a Leg
be updated.I could also see this functionality being provided rather easily by a postgresql materialised view, but that would be Postgresql-only (and BalanceStatement
s would be updated as a batch job). How would people feel about that?
I just had a little play with the SQL required to generate balances for all accounts (would would be useful for implementing this as a materialised view). Not sure if this will be useful, but I need to run so I'll leave it here in case it is:
UPDATE: Ignore this old implementation, instead see the new version in source.
SELECT
A.id as account_id,
L.*
FROM hordak_account A
INNER JOIN LATERAL
(
SELECT
L2.amount_currency as balance_currency,
COALESCE(SUM(L2.amount), 0.0) as balance,
MAX(L2.id) calculated_to_leg_id
FROM hordak_account A2
INNER JOIN public.hordak_leg L2 on L2.account_id = A2.id
WHERE A2.lft >= A.lft AND A2.rght <= A.rght AND A.tree_id = A2.tree_id
GROUP BY L2.amount_currency
) L ON True;
The results will be unique on (account_id, balance_currency)
EDIT 1: Related: Calculating the balances for a list of accounts is also slow because this is all calculated in Python/Django, and not directly in the database. If we had a database function for account balance calculations (ACCOUNT_BALANCE(account_id)
) then we could pull balances from the database along with their accounts (relates to #52). For example: Account.objects.annotate(balance=Balance(F('account_id')))
EDIT 2: A question occurs to me: Is a BalanceStatement
supposed to be used as an internal caching tool only, or is this also a user-facing concept in some way? I suspect it is the former.
@adamcharnock Ours service has ~1M users, (x2 Hordak accounts), ~13M transactions (x2 legs). Some of the accounts have much larger number of accounts than others, I would expect that maximum can reach 1M transactions per accont. But I had to solve the performance issues few years back, so I expect at ~10x less data.
We are using PostgreSQL, so PSQL only is not a problem for us. I am not sure, if DB functions would by fast enough, but I can do some tests.
That SQL function doesn't work very quickly for me. This is explain analyze
for 1 account:
explain analyze SELECT
A.id as account_id,
L.*
FROM hordak_account A
INNER JOIN LATERAL
(
SELECT
L2.amount_currency as balance_currency,
COALESCE(SUM(L2.amount), 0.0) as balance,
MAX(L2.id) calculated_to_leg_id
FROM hordak_account A2
INNER JOIN public.hordak_leg L2 on L2.account_id = A2.id
WHERE A2.lft >= A.lft AND A2.rght <= A.rght AND A.tree_id = A2.tree_id
GROUP BY L2.amount_currency
) L ON True where A.id=1;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=15610.28..15614.34 rows=1 width=44) (actual time=31348.592..31348.600 rows=1 loops=1)
-> Index Scan using hordak_account_pkey on hordak_account a (cost=0.09..4.09 rows=1 width=16) (actual time=0.010..0.015 rows=1 loops=1)
Index Cond: (id = 1)
-> GroupAggregate (cost=15610.20..15610.24 rows=1 width=40) (actual time=31348.579..31348.581 rows=1 loops=1)
Group Key: l2.amount_currency
-> Sort (cost=15610.20..15610.21 rows=19 width=14) (actual time=25666.045..28243.838 rows=12684935 loops=1)
Sort Key: l2.amount_currency
Sort Method: external merge Disk: 322928kB
-> Nested Loop (cost=0.17..15610.12 rows=19 width=14) (actual time=0.392..20596.186 rows=12684935 loops=1)
-> Index Scan using idx_hordak_account_on_tree_id_lft_id on hordak_account a2 (cost=0.09..4.09 rows=1 width=4) (actual time=0.020..0.022 rows=1 loops=1)
Index Cond: ((tree_id = a.tree_id) AND (lft >= a.lft))
Filter: (rght <= a.rght)
-> Index Scan using hordak_leg_account_id on hordak_leg l2 (cost=0.09..15581.32 rows=8237 width=18) (actual time=0.370..18445.563 rows=12684935 loops=1)
Index Cond: (account_id = a2.id)
Planning Time: 1.824 ms
Execution Time: 31401.735 ms
(16 rows)
Oof, 31 seconds. Ok, I'll take another look and see what I can do. Off-the-cuff thoughts:
I don't use any child accounts. I just have 2 accounts for every user and then 3 internal accounts with the in/outband transactions.
Ah. Yes, I see you are (reasonably) using the function from the comment above. I've improved this now and you can find the better version here:
Once you've run this SQL, you should be able to do this:
postgres.public> select get_balance(7)
[2024-06-28 20:25:18] 1 row retrieved starting from 1 in 430 ms (execution: 420 ms, fetching: 10 ms)
That 420ms for an account with 1 million legs. What do you see on your side?
UPDATE: This is on an M2 Macbook. Also, if I just calculate the balance for the one account (not including children) then it shaves about 30-40% off the execution time. This is a decent win, but I think we'll get bigger gains from adding running totals, as per this PR.
I'm copying this comment to the #126 PR too.
With huge amount of transaction I started to have problems with performance of counting account totals. I tried to start counting running totals for the accounts, but I realized that the task is more complicated than I anticipated. I ended up rather optimizing the performance of the server, but I will probably need to implement this functionality eventually.
I am leaving the work in progress code here if anyone would be interested in finishing it.