LemmyNet / lemmy

🐀 A link aggregator and forum for the fediverse
https://join-lemmy.org
GNU Affero General Public License v3.0
12.95k stars 859 forks source link

Relax timeout for sending activities #4864

Closed Nothing4You closed 3 days ago

Nothing4You commented 2 weeks ago

Lemmy considers timeouts during activity sending as retryable errors. While it is frequently enough to retry sending the same activity again after the original submission attempt resulted in a timeout, allowing the receiving side to use more time for synchronous processing should reduce the number of retries needed overall and improve overall compatibility.

Some ActivityPub software, such as Mastodon, implements a queue for processing received activities asynchronously, which allows immediately returning a response for activity submissions. Other software, such as Lemmy or Hubzilla implement synchronous processing of activities before returning a response.

ActivityPub does not specify specific timeouts to be used: https://github.com/w3c/activitypub/issues/365

Nutomic commented 1 week ago

Then we also need to increase the timeout for incoming activities. Though that change should be made in a later release for better backwards compatibility.

https://github.com/LemmyNet/lemmy/blob/main/crates/apub/src/http/mod.rs#L34

phiresky commented 1 week ago

I don't really understand what this is supposed to improve. If you really can't handle a single activity in 15 seconds (? not sure what the current timeout is), then you're probably going to have to fix your soft/or hardware since you won't be able to have a normal user doing any actions either. Especially 125 seconds is a huge amount of time.

I would be interested in more concrete examples of a timeout actually causing a problem and an increased timeout actually fixing that problem.

Occasional timeouts are fine and that's what we have the exponential resend back-off for.

I don't have any very specifically useful stories either, but in general I'd say just having large timeouts has disadvantages because that's a time frame in which there's zero feedback. And you can't really do back-off for timeout either to adapt to conditions, so it's a blind change in any case.

Nothing4You commented 1 week ago

The current timeout is 10 seconds.

You'll usually not need a lot of time to process activities that only reference known objects, but if you have to resolve objects, e.g. when receiving announced votes and resolving their actors or when resolving a comment from a nested comment chain where you may be missing earlier replies this can take a good amount of time.

While the time until feedback is indeed longer, if you abort sending the activity earlier than the remote software decides to fail the request you'll just keep retrying the same activity that might never succeed due to being stuck in a retry loop rather than waiting enough time for it to actually properly fail in a way that isn't retryable.

phiresky commented 1 week ago

I agree 10s is kinda short and I'd be fine with setting it to 30s due to the resolving, but I'd still be interested in specific examples if we really want to do more than that. If there's a case like you mention where a software does indeed go into an endless loop, then we should already be seeing this happening somewhere with the current method of sending a single activity at a time, since then that outgoing federation queue would stay on the same activity forever.

I also remember we had some other issues with too high timeouts causing potential for intentional DOSing, though afaik that was more on the receiving side in this code.

Nothing4You commented 1 week ago

Hubzilla uses a 60 second timeout by default for incoming activity processing, I'm not sure about others. Just above 120 seconds is to err on the safe(r) side, as there are no timeouts defined in the spec. We can't expect specific limits on the receiving side, so this is just guesswork of what is likely to be most compatible. We also can't tell a difference in whether a request timed out because of specific information in the activity is resulting in extremely slow processing or whether the instance is just having a hard time to keep up or is currently unavailable.