[Bug]: Federation throughput in a synchronous manner is limited to network distance

ticoombs commented 7 months ago

Requirements

[X] Is this a bug report? For questions or discussions use https://lemmy.ml/c/lemmy_support
[X] Did you check to see if this issue already exists?
[X] Is this only a single bug? Do not put multiple bugs in one issue.
[X] Do you agree to follow the rules in our Code of Conduct?
[X] Is this a backend issue? Use the lemmy-ui repo for UI / frontend issues.

Summary

Problem: Activities are sequential but requires external data to be validated/queried that doesn't come with the request. Server B -> A, says here is an activity. In that request can be a like/comment/new post. An example of a new post would mean that Server A, to show the post metadata (such as subtitle, or image) queries the new post.

Every one of these outbound requests that the receiving server does are:

Sequential, (every request must happen in order: 1,2,3,4...
Is blocking. Server B which sent a message to server A, must wait for Server A to say "I'm Finished" before sending the next item in queue.
Are inherently subsequent to network latency (20ms to 600ms)
- Australia to NL is 278ms (round trip 556ms)
- NL to LA is 145ms (round trip 290ms)
- I picked NL because it is geographically, and literally, on the other side of the world from Australia. This is (one of) if not the longest route between two lemmy servers.

Actual Problem

So every activity that results in a remote fetch delays activities. If the total activities that results in more than 1 per 0.6s, servers physically cannot and will never be able to catch up. As such our decentralised solution to a problem requires a low-latency solution. Without intervention this will evidently ensure that every server will need to exist in only one region. EU or NA or APAC (etc.) (or nothing will exist in APAC, and it will make me sad) To combat this solution we need to streamline activities and how lemmy handles them.

Steps to Reproduce

Have a lemmy server in NL send activities faster that 1 request every 0.6 seconds to a lemmy server in australia.
If you send New Post activities, they can affect the activity processing the most / are the longest to help validate the PoC.

Technical Details

Trace 1:

Lemmy has to verify a user (is valid?). So it connects to a their server for information. AU -> X (0.6) + time for server to respond = 2.28s but that is all that happened.

- 2.28s receive:verify:verify_person_in_community: activitypub_federation::fetch: Fetching remote object http://server-c/u/user
- request completes and closed connection

Trace 2:

Similar to the previous trace, but after it verfied the user, it then had to do another from_json request to the instance itself. (No caching here?) As you can see 0.74 ends up being the server on the other end responding in a super fast fashion (0.14s) but the handshake + travel time eats up the rest.

- 2.58s receive:verify:verify_person_in_community: activitypub_federation::fetch: Fetching remote object http://server-b/u/user
- 0.74s receive:verify:verify_person_in_community:from_json: activitypub_federation::fetch: Fetching remote object http://server-b/
- request continues

Trace 3:

Fetching external content. I've seen external servers take upwards of 10 seconds to report data, especially because whenever a fediverse link is shared, every server refreshes it's own data. As such you basically create a mini-dos when you post something.

- inside a request already
- 4.27s receive:receive:from_json:fetch_site_data:fetch_site_metadata: lemmy_api_common::request: Fetching site metadata for url: https://example-tech-news-site/bitcoin-is-crashing-sell-sell-sell-yes-im-making-a-joke-here-but-its-still-a-serious-issue-lemmy-that-is-not-bitcoin

Trace 4:

Sometimes a lemmy server takes a while to respond for comments.

- 1.70s receive:community: activitypub_federation::fetch: Fetching remote object http://server-g/comment/09988776

Version

0.19.3

Lemmy Instance URL

No response

phiresky commented 6 months ago

Maybe the way forward is to purely implement only the fully-parallel sending (limited to N=10 inflight per instance) and go through every activity type and make sure we can make them commutative (basically just skip activities that have been overwritten already (point 5+6 from my list above).

phiresky commented 6 months ago

We could repeat to split by post_id, but even that may not be enough.

I kinda disagree with this part though. We can easily move to post_id (community_id not needed except for non-post related actions) and it's very unlikely that we will ever need more. Even if we do, using coalesce(comment_thread_id, post_id, community_id) would still be a possible later change.

Nutomic commented 6 months ago

I kinda agree and it's what I thought too before I read your PR code and realized all those complexities and that it's not really simpler. You still need a solution for what happens during server restart or server crash so no activities are lost, and right now you have the same sequential processing per community due to the per-community queue and the issue you have otherwise with activities being processed out-of-order. To fix that you (I think) need the same data structure I made above (it could also be on the receiving side but that's not really any better).

Not true, my pr doesn't have any per-community queue. And handling restarts or crashes works just like before, using last_activity_id (still need to find a way to update that correctly). If an instance is crashed or restarted it will resend some of the same activities again, but those duplicates already get rejected by Lemmy when receiving so it's fine. And activities are put in the correct order on the receiving side.

Only problem is now with configuration, having 5 workers per instance could mean 10 instances 5 workers = 50 workers, or could be 1000 5 = 5000. So a shared worker pool for all instances would be better so that you can configure exactly 50 workers to be active at any time. Anyway the pr implementation works and is better than what we have now.

Maybe the way forward is to purely implement only the fully-parallel sending (limited to N=10 inflight per instance) and go through every activity type and make sure we can make them commutative (basically just skip activities that have been overwritten already (point 5+6 from my list above).

I agree this is better than going by published timestamp. It doesn't require a complicated queue and works even if an activity is out of order by more than a second.

I kinda disagree with this part though. We can easily move to post_id (community_id not needed except for non-post related actions) and it's very unlikely that we will ever need more. Even if we do, using coalesce(comment_thread_id, post_id, community_id) would still be a possible later change.

Sure it's possible, but I believe it would result in much more complex code than the above. And I don't see any benefit that would justify the extra complexity.

phiresky commented 6 months ago

Not true, my pr doesn't have any per-community queue

Sorry, I misspoke, you need a per-something queue. In your case you added a per-instance queue on the receiving side, which means you solve the network latency problem (this issue, point 2 above) but not the internal latency issue (the one db0 had, point 3 above). If you wanted to solve 3 you'd have to split the receiving queue further while still keeping sequentiality somewhere.

And handling restarts or crashes works just like before

I was talking about crashes or restarts on the receiving side. You respond with 200 before actually having processed anything which means the sender thinks everything has been sent and will never resend it, even if the receiver crashes with many activities only in memory, those are lost. That's why I don't think any of that receiving side things should be added.

Sure it's possible, but I believe it would result in much more complex code than the above. And I don't see any benefit that would justify the extra complexity.

My main point is that you already need that in your PR if you want to solve all the problems we have, since you don't remove any sequentiality as opposed to 0.19 at all just network latency. Whether on sending or receiving side.

Nutomic commented 6 months ago

Right I also think that the receiving queue is not such a good idea, and better to change individual activity handlers to check the timestamps, so that older edits or votes are ignored. Then the sending side should work in parallel without problems.

Nutomic commented 6 months ago

Thinking about this more, I dont think its possible to handle all activities correctly when they are received in the wrong order. Its fine for post or comment edits as we have can read the existing post and compare timestamps. But imagine you upvote a post and then immediately undo the vote. If another instance receives them in the wrong order, it would be Undo/Vote first and then Vote. The undo would do nothing as the corresponding vote doesnt exist yet. Then the vote gets received and stored in the database. So there will be one vote counted wrong, and I think this action is rather common. The same would happen if you remove a post and immediately restore, or add a mod and immediately remove him again, or sticky and then unsticky a post. Handling that for each activity type separately would get too complex, so I dont think there is any other way than a time-based queue for incoming activities.

phiresky commented 6 months ago

it would be Undo/Vote first and then Vote.

Right now maybe, but those could be changed right? For example, currently PostLike::remove currently deletes the row, but it could be changed to update(post_link).where(newpublished.lg(published)).set(score.eq(0)).set(published.eq(...)). This would mean unvotes remain in DB but that shouldn't be an issue.

The same goes for post remove + restore. A post update can't use the published field but it can use the updated field (or a new field) - a post remove would only go through if the updated published timestamp of the remove is larger than the updated timestamp of the post.

I would consider these part of what I meant above with making them commutative. It doesn't seem too complex to me, though it would probably require thinking through every case individually in detail. It seems like it should be possible to get through all or most by just comparing timestamps.

Nutomic commented 6 months ago

Youre right that should work.

Fmstrat commented 4 months ago

Two questions (hopefully not dumb):

Why can't there be batch queues? If 12 comments are marked to federate, do up to n at once? Reducing overhead is a common way to solve these types of problems when ordering is important. I'm guessing this has been considered before. It may not be standard AP, but no reason supporting instances can't interact in nonstandard ways.
If ordering is the problem, why not add a parent key to activities? Then if an activity comes in with a parent, but that parent doesn't yet exist, it can be placed in a local queue until the parent arrives. This allows for parallel processing over the wire, and garbage collection if something sits for say, 48 hours.

Nutomic commented 1 month ago

Fixed in https://github.com/LemmyNet/lemmy/pull/4623

LemmyNet / lemmy