LemmyNet / lemmy

🐀 A link aggregator and forum for the fediverse
https://join-lemmy.org
GNU Affero General Public License v3.0
13.15k stars 870 forks source link

Serverside Tree/Thread Order Sorting and Pageing Proposal #949

Closed ryexandrite closed 6 months ago

ryexandrite commented 4 years ago

Rationale

The current design has the server return all comments for a post thread, they may be sorted by post date or rank but in order to display comments and their replies in thread order they must be sorted client-side. If the paging parameters are used it is possible to get a comment with a parent but not retrieve the parent/parents. this leaves questions as to if they should be displayed as top-level comments or left orphaned and not displayed.

This design leads to some performance problems clients side with large threads. Our load testing ended up with a comment thread thousands of comments long which gave us a large number of issues to resolve, and questions on how to handle them. Work was done to implement paging client-side by only rendering part of the tree at once, but there is still a large amount of time spent sorting the comments into thread order. We also encountered garbage collection problems. These large threads also have long loading times as the full list of comments is retrieved, and then sorted.

These types of performance issues are heavily dependent on the client device and its available memory. a desktop browser may not have too many issues besides a slow slowdown on load, but a mobile device may well encounter memory issues or full load and heat-related slowdowns.

To solve these performance problems and to enable other ways of implementing clients it would be best to allow for tree/thread order depth-first sorting.

Solutions

In order to solve this either the sorting needs to be done in rust or the sorting needs to be done in the database. It's most likely that the database will be far more efficient at this so I will present options for that and the arguments for and against them.

Recursive CTE

This is a complicated one so I'll break it down into the multiple ways this could be used.

an example query of this type is here:

WITH RECURSIVE 
node AS (
  SELECT * FROM comment_aggregates_fast c 
), tree AS (
  SELECT
    n.*,
    1 AS depth,
    ARRAY[n.id] AS tree_path
  FROM node n
  WHERE n.parent_id IS NULL
  UNION ALL
  SELECT
    n.*,
    t.depth + 1 as depth,
    t.tree_path || n.parent_id as tree_path
  FROM tree t
  JOIN node n ON n.parent_id = t.id
)
SELECT *  FROM tree
WHERE post_id = 49
ORDER BY tree_path, published;

Sadly the query planner optimizations still can't push down where parameters into with recursive statements even with as not materialized specified on the clause as of postgress 12. As such the recursive union is done on ALL comments and not just those of the post, this is slow, very slow. if however that first clause looks like ...

WITH RECURSIVE 
node AS (
  SELECT * FROM comment_aggregates_fast c WHERE post_id = 322
), ...

it can be fast. EXPLAIN ANALYZE reported ~1.5ms planning time and ~400ms execution time for our megga thread and 5ms or less execution time for a more reasonable thread with 300 comments.

What slow down there is in our testing can be attributed to the fact that our megga thread was one long chain with a depth level that went well over 2k. Adding WHERE depth < 20 to the recursive union which, if parameterized would allow the clients to specify a depth to retrieve, results in speedups. Same planning time as before but ~35ms execution time as opposed to ~400.

One more thing is that WHERE n.parent_id IS NULL in the upper part of the union could be expanded with limit and offset to page on top-level comments. and to make further use, the n.parent_id IS NULL could be exchanged with n.id IS ? to retrieve a specific comment thread.

Now to discuss implementation...

In a VIEW

The current practice has been to place queries like this in views, This allows for the schema to be work with inside Diesel's ORM and have the queries type-checked. however, views can not be parameterized and the query planner will not push these limiting parameters down so the view route is a very significant performance hit. The CTE should not be placed into a view.

In a Function or Procedure

specifying it as a function or procedure means that function either needs to be defined rust side, including its return type, via sql_function!() (specifying the types of all the table rows) or requires the use of .sql_query(&str) which means that the table needs to be specified rust side as well. Any time this return type specification is needed means it has to be kept in sync with the schema of the tables and any views.

It is potentially possible to have the function be something like

DROP FUNCTION IF EXISTS comment_fast_tree_view;
CREATE OR REPLACE FUNCTION comment_fast_tree_view(post_for int, top_limit int, top_offset int, depth_limit int)
  RETURNS TABLE (
    id int,
    depth int,
    tree_path int[]
    )
AS
$body$
WITH RECURSIVE tree AS (
  SELECT
    n.id,
    1 AS depth,
    ARRAY[n.id] AS tree_path
  FROM comment_aggregates_fast c
  WHERE c.parent_id IS NULL AND c.post_id = post_for
  LIMIT top_limit OFFSET top_offset
  UNION ALL
  SELECT
    n.id,
    t.depth + 1 as depth,
    t.tree_path || n.parent_id as tree_path
  FROM tree t
  JOIN node n ON n.parent_id = t.id
  WHERE depth < depth_limit
)
SELECT * FROM tree
$body$
language sql;

and then rejoin the result to the table, allowing the return type of the function to be controlled in a way that should be independent of the schema so long as parent_id continues to be the column name. but the full type still needs a properly defined struct to be deserialized.

As raw SQL Rust side

The next alternative is to use go straight for .sql_query() rust side and enter the whole thing there. This allows for the binding of parameters, and potentially some slight edits of the query at runtime. This has the same task of specifying the full return type in order to be deserialized.

Using the Desiel QueryBuilder

The more complicated route, and the one I have spent nearly a full day hurting my brain researching, is to use the Diesel QueryBuilder via the QueryFragment trait. This would require mirroring some of how Diesel internally implements joins but would allow the extra depth and tree_path columns to be "dynamically" (as in appended to the static type definition) appended to the table definition (though, I haven't figured out the specifics of how that is done yet, might require a macro to explicitly define a new table based on the old one.). This is the only route that keeps fully within Rust's type safety and would catch potential problems at compile type. It also means that the code could be reused if comments or another tree-like structure gets implemented in another feature (perhaps comments on future wiki pages? who knows).

Add a path column directly to the table

These options would add the tree path as a column in the table, making it available for sorting. They avoid the need for an expensive CTE and remove the need to extend Desiel's query builder.

This obviously means a modification to an existing schema and in some sense constitutes denormalization of the table. both options presented here lock the database backend to Postgress, but I'm not sure if that is even an issue.

Using an Array

presuming a int[] column in the table where a comment has a path constructed of it's parent id's (ie. [433, 478, 489] would specify that its immediate parent is comment 489 which is a child of 478 which is a child of 433 which is presumably a top-level comment on the post. ) This would allow for order by path to sort the table into depth-first order.

While this would avoid the need for a recursive CTE resulting in faster queries, the problem here is that variable size columns mean postgress has a more difficult time creating efficient indexes and storing ops for the table. Additionally, if you want to retrieve threads by rank you still have a complicated query to deal with.

Using An LTree

To quote a suggestion I received on this proposal from another member of our team:

A possibility to adding a path column to the table is to use Postgre's ltree extension (comes default so no work needed on the docker img) and make the path column an ltree data type. This will alleviate the indexing concerns of using a []::int type column since the ltree data type can use GIST. It'll also enable querying by parent/ancestor and "moving" a branch from one to another pretty easy plus in general it has flexible querying options built in. Since comments are hierarchical trees it's probably a good fit and limiting the amount of depth an initial view reaches can be done fairly easy and then getting "sub tree" links as a "view more" reddit style link is also pretty easy.

For ex. if our default view wants all posts and the first 2 comments: select * from comments where post_id = 1 and nlevel(path) < 3

And then getting the continued thread: select * from comments where id = :comment_id and 'post_id.comment_id' @> path

I believe with something like that we can avoid recursive queries. The downside is locking the database engine to postgres, but I'm not sure that should be a concern, and using ltree will have added learning curve and additional api/fe code changes to properly use. It should play nice with Rust if when returning values we cast the tree to a string or something else.

Official docs: https://www.postgresql.org/docs/current/ltree.html

My first take on a preference would be a ltree path column and reworking the queries into fragments/raw sql. For the sake of a quick turnaround though the function option with the CTE also looks good. The use of views as-is should probably not be considered? I spent the better part of last weekend trying to just alter how the views work and kept hitting brick walls.

This idea may well be the "best" option.

Diesel support for the ltree datatype can be obtained from https://crates.io/crates/diesel_ltree and from there the work can be done inside ORM with relative ease.

Since comments should rarely, if ever, change their tree path. The path can be assigned on the creation of the comment as simply an extension of its parent with no need for triggers to maintain them.

Individual labels in a path have a limit of 256 characters so it's entirely possible to at some point switch from serial IDs to a from of UUID with no friction. Top-level comments could just have a path of top or similar

In Summery

The end result of this work is that paging of comments can be properly done server-side while preserving tree integrity. A side effect is that comment threads could be potentially retrieved per comment like what is done on reddit currently in some cases.

I hopefully have presented the need for this type of sorting to be done efficiently server-side0 if not please ask for clarification.

I present this information and the potential solutions I've worked on because I'm ready and willing to work on its implementation but I do not know which approach would be most welcome to incorporation in Lemmy, I look forward to all your input.

--- Want to back this issue? **[Post a bounty on it!](https://www.bountysource.com/issues/92078307-serverside-tree-thread-order-sorting-and-pageing-proposal?utm_campaign=plugin&utm_content=tracker%2F126011972&utm_medium=issues&utm_source=github)** We accept bounties via [Bountysource](https://www.bountysource.com/?utm_campaign=plugin&utm_content=tracker%2F126011972&utm_medium=issues&utm_source=github).
dessalines commented 4 years ago

Before I get into some of your specifics I want to point out my preference for how tree-based paging should / could work. I also want to say that I thought about working all this out before launching, but figured we'd reach this point as lemmy grows organically, and that an adjacency list is good enough then... obvi now we're reaching the point where tree-paging is gonna have to get added.

SQL

I have some previous work done on creating views, breadcrumbs, etc for comment_trees, just let me know if you want the views (but be aware I'm pretty bad at SQL and they might not be that great).

Back end

Front end


I see you're trying to do the recursive CTE with the adjacency list that lemmy currently has, and trust me its just not worth it to try to make that efficient in SQL.

As such the recursive union is done on ALL comments and not just those of the post, this is slow, very slow.

This is the thing I hate about CTE's, that we can't even take advantage of indexes when joining back. With a bridge table, these wouldn't be necessary, nor would postgresql's expansions like the one you mentioned above.

Also yes I do far prefer keeping the join / SQL logic away from diesel, and in views. If procedures are absolutely necessary, I'd be fine with that, but I doubt you'll need them when using a bridge table.

Also keep in mind that we also need a child_count, (this number might also possibly mean all children in that tree leaf) for the front end, bc an important part of tree paging means that it needs to know if there are more comments past max-depth.

I'll help where I can, but I might not have that much time to devote to this at the moment, bc it is obviously a huge effort, and I have federation and FE work. But lets use this issue to keep this going because its definitely a problem we'll need solving, if lemmy wants to grow to the scale of 10k comments / post.

Thanks for doing all this work btw.

dessalines commented 4 years ago

BTW here's a better thread about bridge tables: https://stackoverflow.com/questions/192220/what-is-the-most-efficient-elegant-way-to-parse-a-flat-table-into-a-tree/

ryexandrite commented 4 years ago

Thanks for the suggestion, I hadn't actually considered the bridge table option.

I have put together some adtional input on the bridge table and the other options presented


Talking this over with @masterstur there are a few points:

The bridge-table idea in effect duplicates the functionality of the postgress inbuilt ltree datatype but comes with the overhead of table bloat, redundant data, and database triggers that can hide issues and present opportunities for deadlocks via transactions. But we get none of the nice features that come with ltrees and still have to write complicated queries.

@masterstur [on bridge-tables]:

So, a bridge table has a couple notable downsides for the sake > of pure read speed. The table is going to bloat with > redundant data very quickly (furthering the uuid issue), and > doing any kind of insert/updating is a relative headache. > Maintaining triggers to handle all of this without a fulltime > DBA is also not a decision I would personally make, they're > hard to test and rarely surface bugs in a good way.

[...]

there's no real reason to use it over ltree since the major > downsides of it are trivial with an ltree and idk PG is > giving you a purpose built data type for trees just use it imo.

[...]

The one upside is a integer index would be quicker than a gist > but it's a wash when you compare index size after 100k > comments.

Better to think of it in terms of relative ease of use in the > api, with a bridge you have to manually handle the path > length, with ltree you call nlevel() and commit.

@masterstur [on indexes with CTE's]:

If the main issue is index usage then it's both solvable by > ltree and cte.

[...]

It's not a flaw of using CTEs but how the query is constructed > (so, I also disagree with keeping sql away from diesel, pick > 1: sound db design, not using sql).

People kinda get the boogeyman .over CTEs because of old hat > takes, or using other engines They're not my first choice, but I use them in analytics > reporting on tables w/ billions of rows and have query costs > at 2k while we're trying to fix query costs of 2mil here. Generally its only an "optimization fence" if you have an > awkward query to begin with.

@masterstur [on triggers]:

as a minor nit, im not a big fan of triggers in this context if all implications aren't really understood, each one creates a new transaction, if your parent call is also using a transaction and things get long running (because you don't have a FT dba), you can still get dead locks and that sucks


You do have a point that the path information should probably be its own table however.

create table comment_bridge (
  comment_id (UUID|INT) NOT NULL REFERENCES comment(id),
  tree_path ltree NOT NULL,
  primary key (comment_id)
);
create index comment_path_gist_idx on comment_bridge using GIST (tree_path);

would allow for an easily joinable table. Moving the ltree off table helps keep the heavier work load off the table and can help prevent bloating in the index.

@masterstur on this issue:

if you move the ltree out it helps keep probably the heavy workload off the main table, useful if it's a table with a lot of writes/updates.

particularly the gotcha is table/idx bloat and fragmentation, one thing I don't like about pg is updates mapping to a delete/insert under the hood so each update contributes to bloat. if you have a lot of updating/inserting on comments, you naturally bloat the index on path, and the other way around, helps to keep them apart. or the most common scenario, you have a table that's largely static except 1-2 cols that update constantly.

We're on PG12 so the upside is we have access to reindex which can concurrently rebuild indexes in place so that makes the bloat issue pretty trivial to handle. table bloat you kinda just have to live with unless you setup a system or migrations to create new tables and copy the data over.


Notes on acquiring the child_count

With a CTE this gets expensive and is best to do for one comment at a time. The process is essentially anther join that is not depth limited grouping by the path and counting.

With an ltree this can be quite simple and quite fast:

select t.comment_id, t.tree_path, ct.child_count
from (
  select c.comment_id, count(ch.path) as child_count from comment_bridge c
  left join comment_bridge ch on ch.tree_path <@ c.tree_path and ch.tree_path != c.tree_path
  group by c.comment_id
) ct join comment_bridge t on ct.comment_id = t.comment_id

would show the number of children for each comment.

dessalines commented 4 years ago

I can get on board with that, ltree's are new to me so I'm unfamiliar with them, but looks like they fit our case. I'll have to do some learnin on how to query / insert with them.

In this case tho, wouldn't it make sense to just add the ltree column directly to the comment table? I don't think it costs us anything to do so, just pushing that column to a whole nother table doesn't make sense to me.... one less trigger / API and federated insert to handle.

Nice, I like the child count fetching too. Good thing there's that rust-diesel-ltree, so we can hopefully still use views and select's with diesel.

One question I have, is say we query a view that gives us (comment_id, child_count, ltree, ... everything else), how do we read that ltree column to efficiently build the tree in rust.

build-a-dang-train commented 4 years ago

Hi @dessalines wanted to let you know that we are going to take a crack at implementing the Ltree proposal (in feature branch based off of the lemmy upstream) because it's functionality that we would like to have in place before we start bolting more things on to comments (such as pinned comments.)

dessalines commented 2 years ago

Been working on this for a bit. Some TODOs:

dessalines commented 2 years ago

Fixed by #2362

Nutomic commented 1 year ago

Dont see any reason to reopen a solved issue.

dessalines commented 1 year ago

Its not solved, I'm only correctly paging nested comments, not top-level ones. That still needs a fix, which is the reason for my 300-max comment PR.

Ryex commented 1 year ago

Move the tree path into its' own table. we pointed that out back when we first made the proposal that keeping it on the comment table would lead to "table/idx bloat and fragmentation". it also has the heavy workload on a large table.

use a light table with the comment IDs and the tree path for the main index. Use the tree path table and perhaps a second separated light rankings table to sort and pre-select all the comment ids you need BEFORE joining the heavy data of the comments table. that will significantly reduce the workload and let the PostgreSQL query planner optimize. you will also stop index bloat and solve the comment depth issue.

Nutomic commented 6 months ago

@dessalines Its not clear what else needs to be done to resolve this issue, so please open a new one with any remaining tasks.