Open dessalines opened 1 year ago
I would intuitively say that most of the reason for the slow response is not the triggers but rather the inline acrivity sending #3493. The reason is that the database update lock only single rows and are thus perfectly serializable. Basically your vote only has to wait for other votes on the same exact post/comment, so if the vote count changes by 2 when you press the button it only had to wait for one other vote that each should take less than 10ms to process for the database.
I don't have good numbers though so I might be wrong. You can track these timings by enabling the pg_stat_statements.track='all' option.
Edit: I have actually not looked at all at the non-simple-increment-decrement triggers at all, possibly disregard the above depending on when exactly those trigger.
My suggestion would be as follows:
person_aggregates_comment_count
) that do aggregating queries where I'm not sure why and it seems like they should be able to do it with a simple inc/decrement instead. O(1) triggers should be fine imo.dirty: boolean
flag on the target row. There should be an index on WHERE dirty
for those tables, which are then updated in a scheduled task.Here's a full table of all triggers. :heavy_check_mark: means this trigger is probably ok, :interrobang: means this trigger is buggy and should be fixed, :x: means this trigger is too expensive and should be replaced.
trigger source | trigger name | trigger target tables | run time |
---|---|---|---|
comment | comment_aggregates_comment | comment_aggregates | :heavy_check_mark: constant, no fetches |
comment | community_aggregates_comment_count | community_aggregates | :interrobang: for some reason this trigger includes comment in its FROM query, I don't think it should. |
comment | person_aggregates_comment_count | person_aggregates | :x: O(n), fetches all comment likes of person, should do an constant incremental update instead |
comment | post_aggregates_comment_count | post_aggregates | :heavy_check_mark: constant, fetches relevant post |
comment | site_aggregates_comment_delete | site_aggregates | :interrobang: constant, no fetches. but bug because there's a missing WHERE id = 1 so it updates all aggregates not one. (#3704) |
comment | site_aggregates_comment_insert | site_aggregates | :interrobang: constant, no fetches, but bug because of missing WHERE id =1. (#3704) |
comment_like | comment_aggregates_score | comment_aggregates | :heavy_check_mark: constant, fetches comment |
comment_like | person_aggregates_comment_score | person_aggregates | :heavy_check_mark: constant, fetches comment |
community | community_aggregates_community | community_aggregates | :heavy_check_mark: constant, no fetches |
community | site_aggregates_community_delete | site_aggregates | :interrobang: constant, no fetches, butt bug because of missing WHERE (#3704) |
community | site_aggregates_community_insert | site_aggregates | :interrobang: constant, no fetches, butt bug because of missing WHERE (#3704) |
community_follower | community_aggregates_subscriber_count | community_aggregates | :heavy_check_mark: constant, no fetches |
person | person_aggregates_person | person_aggregates | :heavy_check_mark: constant, no fetches |
person | site_aggregates_person_delete | site_aggregates | :interrobang: #3704 |
person | site_aggregates_person_insert | site_aggregates | :interrobang: #3704 |
post | community_aggregates_post_count | community_aggregates | :x: O(n): fetches all comments and posts in the community |
post | person_aggregates_post_count | person_aggregates | :x: O(n): fetches all other posts and likes on posts by the person as well |
post | post_aggregates_featured_community | post_aggregates | :heavy_check_mark: constant, no fetches |
post | post_aggregates_featured_local | post_aggregates | :heavy_check_mark: constant, no fetches |
post | post_aggregates_post | post_aggregates | :heavy_check_mark: constant, no fetches |
post | site_aggregates_post_delete | site_aggregates | :interrobang: #3704 |
post | site_aggregates_post_insert | site_aggregates | :interrobang: #3704 |
post_like | person_aggregates_post_score | person_aggregates | :heavy_check_mark: constant, fetches post |
post_like | post_aggregates_score | post_aggregates | :heavy_check_mark: constant, fetches post |
site | site_aggregates_site | site_aggregates | :heavy_check_mark: constant, no fetches |
IMO all triggers on the same table could be merged. That way it should be easier to get rid of those O(n) triggers. I don't think any scheduled tasks are actually necessary.
I don't have good numbers though so I might be wrong. You can track these timings by enabling the pg_stat_statements.track='all' option.
I wish I had caught this sentence 3 weeks ago when you commented it, because without that - the triggers were hidden from pg_stat_statements output and the problems with these triggers would have been front and center.
The other issue not being raised in this issue comments about these triggers is local vs. remote data. The one for site data limits it to only posts/comments with local=true, but person_aggregates and community_aggregates is including federated data... which I think raises a number of concerns with the profile and community stats on remote communities. The home instance for a person or a community is going to have all the data (assuming no purges going on), but it is almost always going to underrepresent on remote instances. And, more to the point of performance crisis/emergency with crashing servers, it is more work to do these counting for person and community that are not own. I think replication of aggregates is an evolutionary path to consider - a profile has to be copied over from home server of a person & community - why not the aggregate counts?
I don't think any scheduled tasks are actually necessary.
These are a performance concern because all SQL triggers are synchronous.
The triggers are also per-row, not per-statement. When this issue was first opened July 7, I viewed it more about moving to batch vs. every single individual item. The scheduled part to me was less important than having the code ability to turn off individual triggers and shift toward batch processing these statistic counts. I think it is only a matter of time until a major world social event happens that many will recall in Internet history - was first September 11, 2001 that brought down so many social media / messaging sites at the time that relied on real-time data. POTS networks failed in many places too. I think it should be a performance choice (parameters) for a server operator to do federation insert in batch, and as mentioned in previous local only counting, etc. And mass data operations I think for both social reasons and performance reasons should be done non-concurrently and avoid real-time API results https://github.com/LemmyNet/lemmy/issues/3697
Given the emergency/crisis of the entire Lemmy platform crashing constantly, is anyone working right now on pull requests to further optimize these triggers? As I've shared code but nobody has commented on actual code. It is tricky to test and understand all these design issues, and I'm also beefing up testing of counts to show where they are not being accounted for (deleted accounts do not seem to unsubscribe, leaving server operators with no way to turn off federation? need confirmation on this, already in pending pull request. Also right now the trigger for account deletion assumes SQL DELETE when it is only an UPDATE? again, needs validation)
P.S. Please excuse the quality of my writing of late, my brain is in extreme pain with communications / language processing. I am making mistakes and often highlight related issues I have not had the time to fully describe - or draw attention to previous discussions made on Lemmy communities.
2. For all other triggers, if any remain, instead of directly updating they should set a
dirty: boolean
flag on the target row. There should be an index onWHERE dirty
for those tables, which are then updated in a scheduled task.
Comments are going to be huge in major social event situations. It is one massive table for the entire site. Doing updates on such a table can be expensive. An alternate is to focus on the timestamp fields of updated/published - almost all activity on a Reddit-like site is concentrated on recent data. And the need for the freshest comments/post can often be organized based on edit timestamp.
I would also point out that the use of two entire fields for deleted and removed could be consolidated into a flag with multiple meanings. Right now, WHERE clauses have to include conditions for both, when a smallint, optimized enum, or character could hold values such as: a = open, d = deleted, r = removed, e = deleted + removed, f = spam filtered pending analysis, p = pending mod approval for troubled user, q = pending mod approval for all. more discussion on Lemmy. I think there is also opportunity to optimize local flag and replace it with just an instance_id value (and always assume instance_id = 1 is local). There are some other performance issues regarding these triggers too... comments table lacks denormalization of 1) community that owns the comment, 3) instance that either home of the community or originated the comment, 3) instance of the person and some other queried person details of the comment (bot accounts). JOIN is often done by post to find community and person details. Well, ap_id on comment can be parsed, and I have actually found PostgreSQL to be performant doing it - to get the instance of the comment creator (but not home of the community / replication source).
There are also issues lurking with accumulation of data. Moving to a batch processing system might want to consider that some instance operators may only wish to retain 60 days of fresh content vs. having every single history of content for a community for search engines and local-searching. The difference in performance is huge, which is why popular Lemmy servers have been crashing constantly - the amount of data in the tables entirely changes the performance characteristics. continued...
@phiresky nice work, shouldn't be too bad to work from that table to fix the triggers. Hopefully 🤞 we already have unit tests to make sure those aggregates are the correct values, but if there aren't, they should be added as part of a PR to fix the triggers.
Hopefully crossed_fingers we already have unit tests to make sure those aggregates are the correct values
My pending pull request with emergency/urgent changes to fix crashing servers has additional testing of aggregates ready to go. Two identified problems: are in the comments of the source code. 1) delete account does not decrement users count for the site. 2) delete account does not unsubscribe from communities (decrement community followers count).
A proposed consolidated trigger that tackles head-on the performance issues of Lemmy API delete of a post, delete of an account, etc: https://github.com/LemmyNet/lemmy/pull/3727
Here's an updated table after 0.18.3. The only remaining expensive trigger is community_aggregates_post_count, which is not expensive on INSERT, but is expensive on restore and delete.
trigger source | trigger name | trigger target tables | run time |
---|---|---|---|
comment | comment_aggregates_comment | comment_aggregates | :heavy_check_mark: constant, no fetches |
comment | community_aggregates_comment_count | community_aggregates | :heavy_check_mark: fixed in 0.18.3 (#3739) |
comment | person_aggregates_comment_count | person_aggregates | :heavy_check_mark: fixed in 0.18.3 (#3739) |
comment | post_aggregates_comment_count | post_aggregates | :heavy_check_mark: constant, fetches relevant post |
comment | site_aggregates_comment_delete | site_aggregates | :heavy_check_mark: fixed in 0.18.3 (#3732) |
comment | site_aggregates_comment_insert | site_aggregates | :heavy_check_mark: fixed in 0.18.3 (#3732) |
comment_like | comment_aggregates_score | comment_aggregates | :heavy_check_mark: constant, fetches comment |
comment_like | person_aggregates_comment_score | person_aggregates | :heavy_check_mark: constant, fetches comment |
community | community_aggregates_community | community_aggregates | :heavy_check_mark: constant, no fetches |
community | site_aggregates_community_delete | site_aggregates | :heavy_check_mark: fixed in 0.18.3 (#3732) |
community | site_aggregates_community_insert | site_aggregates | :heavy_check_mark: fixed in 0.18.3 (#3732) |
community_follower | community_aggregates_subscriber_count | community_aggregates | :heavy_check_mark: constant, no fetches |
person | person_aggregates_person | person_aggregates | :heavy_check_mark: constant, no fetches |
person | site_aggregates_person_delete | site_aggregates | :heavy_check_mark: fixed in 0.18.3 (#3732) |
person | site_aggregates_person_insert | site_aggregates | :heavy_check_mark: fixed in 0.18.3 (#3732) |
post | community_aggregates_post_count | community_aggregates | :x: O(n): fetches all comments and posts in the community |
post | person_aggregates_post_count | person_aggregates | :heavy_check_mark: fixed in 0.18.3 (#3739) |
post | post_aggregates_featured_community | post_aggregates | :heavy_check_mark: constant, no fetches |
post | post_aggregates_featured_local | post_aggregates | :heavy_check_mark: constant, no fetches |
post | post_aggregates_post | post_aggregates | :heavy_check_mark: constant, no fetches |
post | site_aggregates_post_delete | site_aggregates | :heavy_check_mark: fixed in 0.18.3 (#3732) |
post | site_aggregates_post_insert | site_aggregates | :heavy_check_mark: fixed in 0.18.3 (#3732) |
post_like | person_aggregates_post_score | person_aggregates | :heavy_check_mark: constant, fetches post |
post_like | post_aggregates_score | post_aggregates | :heavy_check_mark: constant, fetches post |
site | site_aggregates_site | site_aggregates | :heavy_check_mark: constant, no fetches |
Requirements
Summary
The next best performance improvement, can come from going through every table trigger, seeing anything that's more complicated than a simple increment / decrement (especially those that require functions), and moving them to periodic scheduled tasks.
These are a performance concern because all SQL triggers are synchronous.
Things like community, site and person aggregates, comment counts, user counts, etc do not need to be immediately live.
Context #3188
Steps to Reproduce
See above
Technical Details
See above
Version
main
Lemmy Instance URL
main