Scaling federation - Githubissues

Nutomic commented 1 year ago

Yesterday I posted an announcement telling admins of large instances that they need to increase the "federation worker count". These workers are needed to send outgoing federated actions. Since then I did the same adjustment on lemmy.ml, and had to increase the worker count up to 360.000. Luckily this isnt causing any problems yet, but it points to a scaling limitation which will likely become important in the future.

To understand this limitation its important how federation in Lemmy communities works. Lets say a user from sopuli upvotes a comment in !memes@lemmy.ml. This upvote action is sent via Activitypub to the lemmy.ml server, which forwards it to all instances where at least one user follows the memes community. The same happens for all other actions like creating or editing posts/comments, mod actions, and so on. The problem is that there are lots of these actions (particularly votes), and they need to be forwarded to lots of different servers. For example the recent top posts in /c/memes have around 1500 upvotes. Lets assume that users from 100 different instances follow the community. Then federating the votes for this single post requires 1500 * 100 = 150.000 HTTP POST requests to other servers. On top of this are requests to federate comments and comment votes which likely reach a similar magnitude.

Here are some possible workarounds and solutions:

Dont federate votes. This means users would only see votes from other users on their own instance. Its obviously far from idea, but can be a quick emergency solution to prevent federation from breaking entirely.
Migrate large communities away from lemmy.ml. This requires a lot of effort as users have to subscribe to new communities manually. It is also unlikely to be a permanent solution, as other servers will also get overloaded sooner or later.
Instead of federating individual votes, send the aggregate number of votes during the last hour.

The last option seems to be preferable, but is not easy to implement. Afaik there is no prior example of sending aggregate data over Activitypub, so it would require an extension which would be incompatible with other platforms. It might also be necessary to rewrite the way post ranking is calculated. On the other hand this could be an improvement for privacy, as other instances dont see which particular user upvoted or downvoted a post.

TailyFair commented 1 year ago

Is aggregated votes sending secure? Is it possible that some bad actor instances would send large fake counts?

dadino commented 1 year ago

If sending out the requests is the problem for lemmy.lm, you could send a single request (instead of 100) to a separate worker, on a separate server, that then forwards it to the 100 instances. You could create a load balancer that accepts multiple workers and decides where to send the single request, per request.

simonsan commented 1 year ago

would it be possible to reverse the process? i.e. instead of sending post requests when new votes arrive, to aggregate them and make them available as a subscription on the server, so that they can be retrieved via GET requests from the respective instance?

choucavalier commented 1 year ago

I am not particularly knowledgeable, but could Shared Inbox (part of the ActivityPub protocol) be of help?

Other people seem to have the same issue

dessalines commented 1 year ago

It sounds like a lot of requests, but maybe it isn't a big problem in terms of CPU or network. Rather than aggregating requests, perhaps we just need to make sure our federation job queue is efficient, and if it isn't, possibly use a different one.

rlhennig commented 1 year ago

IMO, aggregation is the best route to take here. This is not actually a new problem, and aggregation is how it's done when disseminating updates between routers runnig BGP on the Internet. (I'm a network architect, so that's how I think of things.) OSPF, another routing protocol, also uses aggregation. When you have hundreds of thousands of updates, sending each one just isn't efficient. Things are going to tip over at some point if you do it that way. It may be more work to refactor things, but anything else I think is--at best--just buying you time.

calculuschild commented 1 year ago

Are votes being updated immediately across all federated instances? If so, is live updating of vote counts even necessary? Aggregating votes every hour would help, but an hour lag seems like a lot.

Why not just retrieve votes from the hosting instance upon the thread being accessed by each user? Surely updating votes on page view will involve fewer requests than sending out a wave of updates on every vote.

dadino commented 1 year ago

Are votes being updated immediately across all federated instances? If so, is live updating of vote counts even necessary? Aggregating votes every hour would help, but an hour lag seems like a lot.

Why not just retrieve votes from the hosting instance upon the thread being accessed by each user? Surely updating votes on page view will involve fewer requests than sending out a wave of updates on every vote.

I guess votes are needed to sort posts. Even 1 bundle per minute would drastically reduce the number of requests, while mantaining a quasi-live update of the content.

DomiStyle commented 1 year ago

The last option seems to be preferable, but is not easy to implement. Afaik there is no prior example of sending aggregate data over Activitypub, so it would require an extension which would be incompatible with other platforms. It might also be necessary to rewrite the way post ranking is calculated. On the other hand this could be an improvement for privacy, as other instances dont see which particular user upvoted or downvoted a post.

To suggest a fourth option: I don't think aggregation is necessary as much as batching is, you could still send each vote individually in a single request but handle them when there is less server load. That way fake votes is less of an issue.

I'm not entirely sure how ActivityPub works but I assume it is legal to respond with a 429 or a 503 and a Retry-After header?

That way the source server could send updates immediately, if the target is overloaded it sends a Retry-After header and the source server batches all updates for that target server together until the time expires.

Could also add prioritization for important events like posts and comments and push votes to a later date when load should be lower. I think having the votes arrive reliably is more important than having them update live. Reddit also does not update votes live.

I just briefly glanced over the ActivityPub spec but there seems to be a collection type for likes, is it possible to use this at least for the votes? https://www.w3.org/TR/activitypub/#likes

Nutomic commented 1 year ago

It seems like this is not really a problem like I thought because these send jobs are very lightweight. Probably the solution is to remove the worker count setting so that unlimited workers can be created on demand.

DomiStyle commented 1 year ago

@Nutomic Is there a different issue to follow that's currently hindering federation then? Federating instances are missing comments and votes.

For example, here's a random post on !technology@lemmy.ml that's 4 hours old: https://lemmy.ml/post/1250165

Here's how it looks on different instances:

Instance	Comments	Votes
lemmy.ml	13	95
beehaw.org	6	22
lemmy.world	11	64
sh.itjust.works	10	45

For comparison, here's a random post from !technology@beehaw.org that's also 4 hours old: https://beehaw.org/post/548636

Instance	Comments	Votes
beehaw.org	70	59
lemmy.ml	63	20
lemmy.world	66	160
sh.itjust.works	62	127

While the votes being different is not such a huge deal, missing comments is absolutely a huge issue.

Kryptortio commented 1 year ago

Does Lemmy utilize some kind of swarm sharing? I'm thinking that scalability could increase a lot if when you ask one instance for an update and it happens to have updates from other instances that you don't have then it could send those as well. If they are aggregated you could check the timestamps to determine if you have collected the entire timeline.

You would need to send more data in the initial request to let the instance know what data you need but the total amount of requests could be greatly reduced and the load of sharing all the data could be distributed across all instances.

Nutomic commented 1 year ago

@DomiStyle Im not sure whats the reason, its certainly worth investating. A possibility would be instance blocks or user bans. Could also be networking problems, or a software bug. There is also this issue which means that activities will get lost during restart.

@Kryptortio Federation uses POST requests, except for explicit user requests to fetch a remote object (eg searching a community url). So there is no automatic "asking other servers". It might be possible to implement something like that, but it would require major changes to federation logic. I suggest you read the Lemmy federation docs and Activitypub standard to get a better understanding how it all works.

Nutomic commented 1 year ago

Closing in favor of https://github.com/LemmyNet/lemmy/issues/3121

LemmyNet / lemmy

Scaling federation #3062