LemmyNet / lemmy

🐀 A link aggregator and forum for the fediverse
https://join-lemmy.org
GNU Affero General Public License v3.0
12.98k stars 862 forks source link

Move expensive DB triggers to scheduled jobs. #3528

Open dessalines opened 1 year ago

dessalines commented 1 year ago

Requirements

Summary

The next best performance improvement, can come from going through every table trigger, seeing anything that's more complicated than a simple increment / decrement (especially those that require functions), and moving them to periodic scheduled tasks.

These are a performance concern because all SQL triggers are synchronous.

Things like community, site and person aggregates, comment counts, user counts, etc do not need to be immediately live.

Context #3188

Steps to Reproduce

See above

Technical Details

See above

Version

main

Lemmy Instance URL

main

phiresky commented 1 year ago

I would intuitively say that most of the reason for the slow response is not the triggers but rather the inline acrivity sending #3493. The reason is that the database update lock only single rows and are thus perfectly serializable. Basically your vote only has to wait for other votes on the same exact post/comment, so if the vote count changes by 2 when you press the button it only had to wait for one other vote that each should take less than 10ms to process for the database.

I don't have good numbers though so I might be wrong. You can track these timings by enabling the pg_stat_statements.track='all' option.

Edit: I have actually not looked at all at the non-simple-increment-decrement triggers at all, possibly disregard the above depending on when exactly those trigger.

phiresky commented 11 months ago

My suggestion would be as follows:

  1. Replace all expensive triggers with simple ones where possible. There's some triggers (e.g. person_aggregates_comment_count) that do aggregating queries where I'm not sure why and it seems like they should be able to do it with a simple inc/decrement instead. O(1) triggers should be fine imo.
  2. For all other triggers, if any remain, instead of directly updating they should set a dirty: boolean flag on the target row. There should be an index on WHERE dirty for those tables, which are then updated in a scheduled task.
phiresky commented 11 months ago

Here's a full table of all triggers. :heavy_check_mark: means this trigger is probably ok, :interrobang: means this trigger is buggy and should be fixed, :x: means this trigger is too expensive and should be replaced.

trigger source trigger name trigger target tables run time
comment comment_aggregates_comment comment_aggregates :heavy_check_mark: constant, no fetches
comment community_aggregates_comment_count community_aggregates :interrobang: for some reason this trigger includes comment in its FROM query, I don't think it should.
comment person_aggregates_comment_count person_aggregates :x: O(n), fetches all comment likes of person, should do an constant incremental update instead
comment post_aggregates_comment_count post_aggregates :heavy_check_mark: constant, fetches relevant post
comment site_aggregates_comment_delete site_aggregates :interrobang: constant, no fetches. but bug because there's a missing WHERE id = 1 so it updates all aggregates not one. (#3704)
comment site_aggregates_comment_insert site_aggregates :interrobang: constant, no fetches, but bug because of missing WHERE id =1. (#3704)
comment_like comment_aggregates_score comment_aggregates :heavy_check_mark: constant, fetches comment
comment_like person_aggregates_comment_score person_aggregates :heavy_check_mark: constant, fetches comment
community community_aggregates_community community_aggregates :heavy_check_mark: constant, no fetches
community site_aggregates_community_delete site_aggregates :interrobang: constant, no fetches, butt bug because of missing WHERE (#3704)
community site_aggregates_community_insert site_aggregates :interrobang: constant, no fetches, butt bug because of missing WHERE (#3704)
community_follower community_aggregates_subscriber_count community_aggregates :heavy_check_mark: constant, no fetches
person person_aggregates_person person_aggregates :heavy_check_mark: constant, no fetches
person site_aggregates_person_delete site_aggregates :interrobang: #3704
person site_aggregates_person_insert site_aggregates :interrobang: #3704
post community_aggregates_post_count community_aggregates :x: O(n): fetches all comments and posts in the community
post person_aggregates_post_count person_aggregates :x: O(n): fetches all other posts and likes on posts by the person as well
post post_aggregates_featured_community post_aggregates :heavy_check_mark: constant, no fetches
post post_aggregates_featured_local post_aggregates :heavy_check_mark: constant, no fetches
post post_aggregates_post post_aggregates :heavy_check_mark: constant, no fetches
post site_aggregates_post_delete site_aggregates :interrobang: #3704
post site_aggregates_post_insert site_aggregates :interrobang: #3704
post_like person_aggregates_post_score person_aggregates :heavy_check_mark: constant, fetches post
post_like post_aggregates_score post_aggregates :heavy_check_mark: constant, fetches post
site site_aggregates_site site_aggregates :heavy_check_mark: constant, no fetches

IMO all triggers on the same table could be merged. That way it should be easier to get rid of those O(n) triggers. I don't think any scheduled tasks are actually necessary.

RocketDerp commented 11 months ago

I don't have good numbers though so I might be wrong. You can track these timings by enabling the pg_stat_statements.track='all' option.

I wish I had caught this sentence 3 weeks ago when you commented it, because without that - the triggers were hidden from pg_stat_statements output and the problems with these triggers would have been front and center.

The other issue not being raised in this issue comments about these triggers is local vs. remote data. The one for site data limits it to only posts/comments with local=true, but person_aggregates and community_aggregates is including federated data... which I think raises a number of concerns with the profile and community stats on remote communities. The home instance for a person or a community is going to have all the data (assuming no purges going on), but it is almost always going to underrepresent on remote instances. And, more to the point of performance crisis/emergency with crashing servers, it is more work to do these counting for person and community that are not own. I think replication of aggregates is an evolutionary path to consider - a profile has to be copied over from home server of a person & community - why not the aggregate counts?

I don't think any scheduled tasks are actually necessary.

These are a performance concern because all SQL triggers are synchronous.

The triggers are also per-row, not per-statement. When this issue was first opened July 7, I viewed it more about moving to batch vs. every single individual item. The scheduled part to me was less important than having the code ability to turn off individual triggers and shift toward batch processing these statistic counts. I think it is only a matter of time until a major world social event happens that many will recall in Internet history - was first September 11, 2001 that brought down so many social media / messaging sites at the time that relied on real-time data. POTS networks failed in many places too. I think it should be a performance choice (parameters) for a server operator to do federation insert in batch, and as mentioned in previous local only counting, etc. And mass data operations I think for both social reasons and performance reasons should be done non-concurrently and avoid real-time API results https://github.com/LemmyNet/lemmy/issues/3697

Given the emergency/crisis of the entire Lemmy platform crashing constantly, is anyone working right now on pull requests to further optimize these triggers? As I've shared code but nobody has commented on actual code. It is tricky to test and understand all these design issues, and I'm also beefing up testing of counts to show where they are not being accounted for (deleted accounts do not seem to unsubscribe, leaving server operators with no way to turn off federation? need confirmation on this, already in pending pull request. Also right now the trigger for account deletion assumes SQL DELETE when it is only an UPDATE? again, needs validation)

P.S. Please excuse the quality of my writing of late, my brain is in extreme pain with communications / language processing. I am making mistakes and often highlight related issues I have not had the time to fully describe - or draw attention to previous discussions made on Lemmy communities.

RocketDerp commented 11 months ago

2. For all other triggers, if any remain, instead of directly updating they should set a dirty: boolean flag on the target row. There should be an index on WHERE dirty for those tables, which are then updated in a scheduled task.

Comments are going to be huge in major social event situations. It is one massive table for the entire site. Doing updates on such a table can be expensive. An alternate is to focus on the timestamp fields of updated/published - almost all activity on a Reddit-like site is concentrated on recent data. And the need for the freshest comments/post can often be organized based on edit timestamp.

I would also point out that the use of two entire fields for deleted and removed could be consolidated into a flag with multiple meanings. Right now, WHERE clauses have to include conditions for both, when a smallint, optimized enum, or character could hold values such as: a = open, d = deleted, r = removed, e = deleted + removed, f = spam filtered pending analysis, p = pending mod approval for troubled user, q = pending mod approval for all. more discussion on Lemmy. I think there is also opportunity to optimize local flag and replace it with just an instance_id value (and always assume instance_id = 1 is local). There are some other performance issues regarding these triggers too... comments table lacks denormalization of 1) community that owns the comment, 3) instance that either home of the community or originated the comment, 3) instance of the person and some other queried person details of the comment (bot accounts). JOIN is often done by post to find community and person details. Well, ap_id on comment can be parsed, and I have actually found PostgreSQL to be performant doing it - to get the instance of the comment creator (but not home of the community / replication source).

There are also issues lurking with accumulation of data. Moving to a batch processing system might want to consider that some instance operators may only wish to retain 60 days of fresh content vs. having every single history of content for a community for search engines and local-searching. The difference in performance is huge, which is why popular Lemmy servers have been crashing constantly - the amount of data in the tables entirely changes the performance characteristics. continued...

dessalines commented 11 months ago

@phiresky nice work, shouldn't be too bad to work from that table to fix the triggers. Hopefully 🤞 we already have unit tests to make sure those aggregates are the correct values, but if there aren't, they should be added as part of a PR to fix the triggers.

RocketDerp commented 11 months ago

Hopefully crossed_fingers we already have unit tests to make sure those aggregates are the correct values

My pending pull request with emergency/urgent changes to fix crashing servers has additional testing of aggregates ready to go. Two identified problems: are in the comments of the source code. 1) delete account does not decrement users count for the site. 2) delete account does not unsubscribe from communities (decrement community followers count).

RocketDerp commented 11 months ago

A proposed consolidated trigger that tackles head-on the performance issues of Lemmy API delete of a post, delete of an account, etc: https://github.com/LemmyNet/lemmy/pull/3727

phiresky commented 11 months ago

Here's an updated table after 0.18.3. The only remaining expensive trigger is community_aggregates_post_count, which is not expensive on INSERT, but is expensive on restore and delete.

trigger source trigger name trigger target tables run time
comment comment_aggregates_comment comment_aggregates :heavy_check_mark: constant, no fetches
comment community_aggregates_comment_count community_aggregates :heavy_check_mark: fixed in 0.18.3 (#3739)
comment person_aggregates_comment_count person_aggregates :heavy_check_mark: fixed in 0.18.3 (#3739)
comment post_aggregates_comment_count post_aggregates :heavy_check_mark: constant, fetches relevant post
comment site_aggregates_comment_delete site_aggregates :heavy_check_mark: fixed in 0.18.3 (#3732)
comment site_aggregates_comment_insert site_aggregates :heavy_check_mark: fixed in 0.18.3 (#3732)
comment_like comment_aggregates_score comment_aggregates :heavy_check_mark: constant, fetches comment
comment_like person_aggregates_comment_score person_aggregates :heavy_check_mark: constant, fetches comment
community community_aggregates_community community_aggregates :heavy_check_mark: constant, no fetches
community site_aggregates_community_delete site_aggregates :heavy_check_mark: fixed in 0.18.3 (#3732)
community site_aggregates_community_insert site_aggregates :heavy_check_mark: fixed in 0.18.3 (#3732)
community_follower community_aggregates_subscriber_count community_aggregates :heavy_check_mark: constant, no fetches
person person_aggregates_person person_aggregates :heavy_check_mark: constant, no fetches
person site_aggregates_person_delete site_aggregates :heavy_check_mark: fixed in 0.18.3 (#3732)
person site_aggregates_person_insert site_aggregates :heavy_check_mark: fixed in 0.18.3 (#3732)
post community_aggregates_post_count community_aggregates :x: O(n): fetches all comments and posts in the community
post person_aggregates_post_count person_aggregates :heavy_check_mark: fixed in 0.18.3 (#3739)
post post_aggregates_featured_community post_aggregates :heavy_check_mark: constant, no fetches
post post_aggregates_featured_local post_aggregates :heavy_check_mark: constant, no fetches
post post_aggregates_post post_aggregates :heavy_check_mark: constant, no fetches
post site_aggregates_post_delete site_aggregates :heavy_check_mark: fixed in 0.18.3 (#3732)
post site_aggregates_post_insert site_aggregates :heavy_check_mark: fixed in 0.18.3 (#3732)
post_like person_aggregates_post_score person_aggregates :heavy_check_mark: constant, fetches post
post_like post_aggregates_score post_aggregates :heavy_check_mark: constant, fetches post
site site_aggregates_site site_aggregates :heavy_check_mark: constant, no fetches