LemmyNet / lemmy

🐀 A link aggregator and forum for the fediverse
https://join-lemmy.org
GNU Affero General Public License v3.0
13.12k stars 863 forks source link

Outgoing federation problem #4288

Closed Nutomic closed 7 months ago

Nutomic commented 8 months ago

Requirements

Summary

There seems to be a problem with federating outgoing activities in 0.19. For example in !lemmy_support the post "Is federation not working on lemmy.ml ?" is (ironically) not making it to any other instances:

Also the mod log for /c/memes hasnt federated to lemmy.ml in five days:

The federation queue shows all these instances as up to date. Server logs dont show anything relevant.

Edit: Also discussed here

Version

0.19.0

Lemmy Instance URL

lemmy.ml

phiresky commented 8 months ago

That's odd. Since the federation queue state says that they are up to date that (I think) means one of these things:

  1. The queue didn't think that the event should be sent to the given instance so it skipped it.
  2. The receiving server responded with a 2XX or 4XX status code to the /inbox request so it was saved as successful.
  3. There's some bug where it skips ahead in the queue for some reason (idk how that could happen, the only time it does this is when the instance is first upgraded to 0.19.)

Could you look in the sent_activity table and send the data for one event we know didn't arrive? You should be able to find it by doing select * from sent_activity where data::text like '%Is federation not working on lemmy.ml ?%'. That entry should have send_community_followers_of set to the community id.

Nutomic commented 8 months ago

That query doesnt return anything. It seems that the activity wasnt submitted to the queue at all. Would also explain why its missing from all other instances.

phiresky commented 8 months ago

i've updated the query it was a bit broken, try again? if it's not in there then that's a different issue but @sunaurus reports having the issue for things that are in the table

phiresky commented 8 months ago

Can you increase the log level of the federation process to RUST_LOG="warn,lemmy_federate=info" and see if it logs any 4xx failures?

Nutomic commented 8 months ago

Right I also forgot the ; for that query. Anyway it still doesnt return anything.

Errors 4xx for activity sending are logged here. I changed log level to debug for that crate and grepped "was rejected". There are countless messages, but nothing that seems relevant.

lemmy-federate_1  | 2023-12-18T14:14:45.170945Z DEBUG activitypub_federation::activity_sending: Activity https://lemmy.ml/activities/announce/like/6fd47179-d88e-45f1-a53a-828b45d274a8 to https://lemmy.richardkramer.de/inbox was rejected, aborting: <!DOCTYPE html>
lemmy-federate_1  | 2023-12-18T14:14:45.174278Z DEBUG activitypub_federation::activity_sending: Activity https://lemmy.ml/activities/announce/like/2ac85733-a54c-4f8a-bb1d-661efce55c5e to https://re.tei.li/inbox was rejected, aborting: {"error":"unknown","message":"Request error: error sending request for url (https://lemmy.ml/c/memes): error trying to connect: error:16000069:STORE routines:ossl_store_get0_loader_int:unregistered scheme:../crypto/store/store_register.c:237:scheme=file, error:80000002:system library:file_open:reason(2):../providers/implementations/storemgmt/file_store.c:267:calling stat(/usr/lib/ssl/certs), error:16000069:STORE routines:ossl_store_get0_loader_int:unregistered scheme:../crypto/store/store_register.c:237:scheme=file, error:80000002:system library:file_open:reason(2):../providers/implementations/storemgmt/file_store.c:267:calling stat(/usr/lib/ssl/certs), error:16000069:STORE routines:ossl_store_get0_loader_int:unregistered scheme:../crypto/store/store_register.c:237:scheme=file, error:80000002:system library:file_open:reason(2):../providers/implementations/storemgmt/file_store.c:267:calling stat(/usr/lib/ssl/certs), error:16000069:STORE routines:ossl_store_get0_loader_int:unregistered scheme:../crypto/store/store_register.c:237:scheme=file, error:80000002:system library:file_open:reason(2):../providers/implementations/storemgmt/file_store.c:267:callingstat(/usr/lib/ssl/certs), error:0A000086:SSL routines:tls_post_process_server_certificate:certificate verify failed:../ssl/statem/statem_clnt.c:1889: (unable to get local issuer certificate)"}
lemmy-federate_1  | 2023-12-18T14:14:45.176978Z DEBUG activitypub_federation::activity_sending: Activity https://lemmy.ml/activities/announce/like/c9609ed9-f07f-4b96-81a4-f778693cabfc to https://readit.buzz/f/inbox was rejected, aborting:
lemmy-federate_1  | 2023-12-18T14:14:45.181506Z DEBUG activitypub_federation::activity_sending: Activity https://lemmy.ml/activities/announce/like/c9609ed9-f07f-4b96-81a4-f778693cabfc to https://2k7.net/inbox was rejected, aborting:
lemmy-federate_1  | 2023-12-18T14:14:45.186670Z DEBUG activitypub_federation::activity_sending: Activity https://lemmy.ml/activities/announce/like/185a1a66-fd2d-45d4-9cea-78eeca3c37df to https://lemmyloves.art/inbox was rejected, aborting: <html>
lemmy-federate_1  | 2023-12-18T14:14:45.187378Z DEBUG activitypub_federation::activity_sending: Activity https://lemmy.ml/activities/announce/like/c9609ed9-f07f-4b96-81a4-f778693cabfc to https://d680-143-44-140-23.ngrok-free.app/inbox was rejected, aborting: <!DOCTYPE html>

And here are the logs for lemmy.world, looks normal.

lemmy-federate_1  | 2023-12-18T12:36:40.760572Z  INFO lemmy_federate::worker: lemmy.world: retrying ActivityId(17953756) attempt 1 with delay 20.00s. (Request error: error sending request for url (https://lemmy.world/inbox): operation timed out)
lemmy-federate_1  | 2023-12-18T12:36:50.321236Z  INFO lemmy_federate: lemmy.world: Warning. 6 behind, 1 consecutive fails, current retry delay 20.00s
lemmy-federate_1  | 2023-12-18T12:37:50.318491Z  INFO lemmy_federate: lemmy.world: Warning. 26 behind, 1 consecutive fails, current retry delay 20.00s
lemmy-federate_1  | 2023-12-18T12:42:55.625274Z  INFO lemmy_federate::worker: lemmy.world: retrying ActivityId(17953896) attempt 1 with delay 20.00s. (Request error: error sending request for url (https://lemmy.world/inbox): operation timed out)
lemmy-federate_1  | 2023-12-18T12:43:50.319353Z  INFO lemmy_federate: lemmy.world: Warning. 41 behind, 1 consecutive fails, current retry delay 20.00s
lemmy-federate_1  | 2023-12-18T12:49:09.148026Z  INFO lemmy_federate::worker: lemmy.world: retrying ActivityId(17954003) attempt 1 with delay 20.00s. (Request error: error sending request for url (https://lemmy.world/inbox): operation timed out)
lemmy-federate_1  | 2023-12-18T12:49:50.321273Z  INFO lemmy_federate: lemmy.world: Warning. 23 behind, 1 consecutive fails, current retry delay 20.00s
lemmy-federate_1  | 2023-12-18T13:10:14.724487Z  INFO lemmy_federate::worker: lemmy.world: retrying ActivityId(17954466) attempt 1 with delay 20.00s. (Request error: error sending request for url (https://lemmy.world/inbox): operation timed out)
lemmy-federate_1  | 2023-12-18T13:10:50.320893Z  INFO lemmy_federate: lemmy.world: Warning. 25 behind, 1 consecutive fails, current retry delay 20.00s
lemmy-federate_1  | 2023-12-18T13:45:20.573342Z  INFO lemmy_federate::worker: lemmy.world: retrying ActivityId(17955167) attempt 1 with delay 20.00s. (Request error: error sending request for url (https://lemmy.world/inbox): operation timed out)
lemmy-federate_1  | 2023-12-18T13:45:50.320426Z  INFO lemmy_federate: lemmy.world: Warning. 12 behind, 1 consecutive fails, current retry delay 20.00s
phiresky commented 8 months ago

I'm still wary of them not being in sent_activity. maybe the above query is still wrong. because if it is not in that table, then there's two very different issues - it not being inserted would be something close to the send_activity_in_community, while it being skipped while sending would mean something in the lemmy_federate crate

Nutomic commented 8 months ago

I think its really a problem with db inserts (or throwing an error at some earlier point). Just now I performed some federated actions like following remote communities and voting on remote posts, yet none of it is reflected in the sent_activities table:

select published,ap_id from sent_activity where actor_apub_id = 'https://lemmy.ml/u/nutomic' order by published desc;
published           |                                  ap_id
-------------------------------+--------------------------------------------------------------------------
2023-12-18 09:00:37.742173+00 | https://lemmy.ml/activities/accept/01777630-b450-4a72-bd7c-c6478ab59301
2023-12-15 09:27:00.765491+00 | https://lemmy.ml/activities/accept/1567666d-32aa-4734-9c36-8ecb1f27d38f
2023-12-14 22:27:50.511456+00 | https://lemmy.ml/activities/like/bb0c74eb-0b17-45f4-9543-b5076511f23b
2023-12-14 22:26:10.024914+00 | https://lemmy.ml/activities/like/c11c7e48-3f41-427e-a603-95cc739b9d44
phiresky commented 8 months ago

huh, that's interesting then and should be "easy" to debug. Maybe you can set the log level of the main process to info to see if this log happens for those activities:

https://github.com/LemmyNet/lemmy/blob/dcb89f52d6c7a0dac6d0cd769ea00c31da7d8dcd/crates/apub/src/activities/mod.rs#L206

Maybe the issue then would be somewhere in the match_outgoing_activities channel thing ( https://github.com/LemmyNet/lemmy/blob/main/crates/apub/src/activities/mod.rs ) because we also changed that code

phiresky commented 8 months ago

There's definitely an issue there: https://github.com/LemmyNet/lemmy/blob/dcb89f52d6c7a0dac6d0cd769ea00c31da7d8dcd/crates/apub/src/activities/mod.rs#L228-L233

If there's an error anywhere in there then that loop permanently ends, and the error isn't even logged (until server shutdown) because the task is tokio::spawned and only joined at the end of the process.

phiresky commented 8 months ago

PR to fix that: https://github.com/LemmyNet/lemmy/pull/4295 (not sure if it fixes this issue though)

Nutomic commented 8 months ago

Version 0.19.1-rc.1 is now available which should hopefully fix this issue.

phiresky commented 8 months ago

Let me know if you see log messages with "error while saving outgoing activity" you see. Also sunaurus reported broken federation for some events which were present in the saved_activity table which can't be something fixed by this.

hanubeki commented 8 months ago

I'm not sure if it is same issue but I don't receive any posts on my Mastodon server fedibird.com (a fork based on Mastodon v3.4.1) after a server is updated to 0.19.0 and 0.19.1-rc.1. My friend reports same issue with his Mastodon v4.1.8 server as well.

Edit: I don't own that lemmy instance so I can't provide logs directly.

etymotic commented 8 months ago

I believe there is still a federation bug.

The response from walden seen here: https://sub.wetshaving.social/comment/967575 does not appear on the other instance: https://discuss.tchncs.de/comment/5855046 Here's another link, this time to the original thread: https://lemmy.ml/comment/6803892

I gave it some time in case it was a queue type of thing, but it was over 30 minutes ago, so should have gone through by now.

phiresky commented 8 months ago

That instance does not seem to currently have working federation due to something else, see https://phiresky.github.io/lemmy-federation-state/site?domain=sub.wetshaving.social . Their federation process seems to not be running since yesterday (for whatever reason of a different shape than this issue)

SrivatsanSenthilkumar commented 8 months ago

feddit.de is also running into outgoing federation problems even though it's updated to 0.19.1

https://feddit.de/comment/5822060. This comment cannot be viewed from other instances. Have tried lemmy.one, lemmy.world and lemmy.dbzer0.

From the lemmy federation site, it seems that feddit.de is seeing almost all other instances as dead instances. https://phiresky.github.io/lemmy-federation-state/site?domain=feddit.de

bjoern-tantau commented 8 months ago

My comments from my personal instance also only started federating after I restarted the Docker container. I think this bug is still present. You should keep an eye on it.

etymotic commented 8 months ago

That instance does not seem to currently have working federation due to something else

I restarted that instance, and now all of the other instances that I linked to have received the post. I'll keep an eye on https://phiresky.github.io/lemmy-federation-state/site?domain=sub.wetshaving.social and see how things go. Thanks.

karasugawasu commented 8 months ago

lm.korako.me also has similar problems. I have looked at the lemmy federation site based on your comments, but it seems that most of them are lagging instances. https://phiresky.github.io/lemmy-federation-state/site?domain=lm.korako.me

wereii commented 8 months ago

My personal instance (0.19.1) https://phiresky.github.io/lemmy-federation-state/site?domain=biglemmowski.win also showed no up to date instances (and most as lagging), after restarting it took around ~30 minutes before the lagging and up to date numbers have switched.
From what I could check, the last last_successful_published_time in /api/v3/federated_instances was indeed late by at least ten minutes but it wasn't "far", mostly 5-10 minutes on top.

Could this be just a caching issue? I do have cloudflare fronting the website.

Edit: Now it actually shows them as lagging again - by ~30 minutes.

In logs I see:

lemmy-lemmy-1  | 2023-12-22T11:38:57.290481Z  INFO lemmy_federate: Federation state as of 2023-12-22T11:38:57+00:00:
lemmy-lemmy-1  | 2023-12-22T11:38:57.290549Z  INFO lemmy_federate: 1872 others up to date. 0 instances behind.
lemmy-lemmy-1  | 2023-12-22T11:38:57.759274Z  INFO lemmy_federate: Federating to 1872/2599 instances (727 dead, 0 disallowed)
etymotic commented 8 months ago

Kept an eye on it today using https://phiresky.github.io/lemmy-federation-state/site?domain=sub.wetshaving.social and federation seemed to work for about 8 hours after a restart before failing. I just restarted it again and changed the logging to RUST_LOG="warn,lemmy_federate=info" to get more information that way.

karasugawasu commented 8 months ago

Out going is still not working on my server. Also, users and communities are pending when I follow them from Mastodon.

I will look at the logs later.

(Sorry, I don't speak English, so I'm using machine translation)

phiresky commented 8 months ago

If most instances show as dead, then thats #4039 . if the instances show lagging then it's a new bug where the queue creation loop somehow stops

etymotic commented 8 months ago

I've set a cronjob to restart the Lemmy container every 6 hours, and so far so good. Lemmy-ui, postgresql, etc. keep chugging along. Instances show as lagging a lot of the time, but it always goes back to "up to date" eventually. https://phiresky.github.io/lemmy-federation-state/site?domain=sub.wetshaving.social

SrivatsanSenthilkumar commented 8 months ago

If most instances show as dead, then thats #4039 . if the instances show lagging then it's a new bug where the queue creation loop somehow stops

Thanks I will follow that and make sure the admins of feddit.de do the same

karasugawasu commented 8 months ago

I changed it to RUST_LOG="warn,lemmy_federate=info" and looked at the log but could not find any useful information.

Federation is still not working on lm.korako.me. This does not change even immediately after rebooting.

lemmy-lemmy-1  | 2023-12-26T01:45:51.000316Z  INFO lemmy_federate: Federation state as of 2023-12-26T01:45:51+00:00:
lemmy-lemmy-1  | 2023-12-26T01:45:51.000424Z  INFO lemmy_federate: 1359 others up to date. 0 instances behind.
lemmy-lemmy-1  | 2023-12-26T01:45:51.320892Z  INFO lemmy_federate: Federating to 1376/1832 instances (456 dead, 0 disallowed)

Is there anywhere else I should look?

As a supplement, I use Cloudflare Tunnel, is this relevant?

phiresky commented 8 months ago

Is there anywhere else I should look?

What do those messages ("Federation state as of XXX: xxx others up to date. XX instances behind) look like over time?

What do they look like after federation stops working?

wereii commented 8 months ago

For me it looks like this https://termbin.com/omf8

Right now https://phiresky.github.io/lemmy-federation-state/site?domain=biglemmowski.win shows all 1865 as lagging behind (even with dev tools open and disable cache ticked in firefox) but last log looks like this:

lemmy-lemmy-1  | 2023-12-26T13:15:46.851551Z  INFO lemmy_federate: Federating to 1873/2610 instances (737 dead, 0 disallowed)
lemmy-lemmy-1  | 2023-12-26T13:15:57.290648Z  INFO lemmy_federate: Federation state as of 2023-12-26T13:15:57+00:00:
lemmy-lemmy-1  | 2023-12-26T13:15:57.290717Z  INFO lemmy_federate: 1873 others up to date. 14 instances behind.
phiresky commented 8 months ago

@wereii thanks for the logs. Do the logs really end at 13:15Z? That's weird then because the last successful send seems to be at 2023-12-26T15:43:22.237593Z for most instances which is 1.5h after that last log. and of course it's also already 30mins ago.

phiresky commented 8 months ago

I might have found the issue. #4330 should fix it. The underlying cause would still be pool timeout errors though which are caused by too small pool or too bad hardware. Under normal operation the pool should never really timeout.

wereii commented 8 months ago

Ad the first comment, here is docker compose logs --tail=1000 lemmy | grep 'lemmy_federate' taken at Tue 2023-12-26 18:26:32 UTC

lemmy-lemmy-1  | 2023-12-26T18:21:49.365376Z  INFO lemmy_federate: Federating to 1874/2611 instances (737 dead, 0 disallowed)
lemmy-lemmy-1  | 2023-12-26T18:21:57.290219Z  INFO lemmy_federate: Federation state as of 2023-12-26T18:21:57+00:00:
lemmy-lemmy-1  | 2023-12-26T18:21:57.290413Z  INFO lemmy_federate: 0 others up to date. 1888 instances behind.
lemmy-lemmy-1  | 2023-12-26T18:22:49.374487Z  INFO lemmy_federate: Federating to 1874/2611 instances (737 dead, 0 disallowed)
lemmy-lemmy-1  | 2023-12-26T18:22:57.289925Z  INFO lemmy_federate: Federation state as of 2023-12-26T18:22:57+00:00:
lemmy-lemmy-1  | 2023-12-26T18:22:57.290128Z  INFO lemmy_federate: 1003 others up to date. 885 instances behind.
lemmy-lemmy-1  | 2023-12-26T18:23:49.383090Z  INFO lemmy_federate: Federating to 1874/2611 instances (737 dead, 0 disallowed)
lemmy-lemmy-1  | 2023-12-26T18:23:57.289988Z  INFO lemmy_federate: Federation state as of 2023-12-26T18:23:57+00:00:
lemmy-lemmy-1  | 2023-12-26T18:23:57.290041Z  INFO lemmy_federate: 1874 others up to date. 14 instances behind.
lemmy-lemmy-1  | 2023-12-26T18:24:49.391845Z  INFO lemmy_federate: Federating to 1874/2611 instances (737 dead, 0 disallowed)
lemmy-lemmy-1  | 2023-12-26T18:24:57.289839Z  INFO lemmy_federate: Federation state as of 2023-12-26T18:24:57+00:00:
lemmy-lemmy-1  | 2023-12-26T18:24:57.290136Z  INFO lemmy_federate: 0 others up to date. 1888 instances behind.
lemmy-lemmy-1  | 2023-12-26T18:25:49.398711Z  INFO lemmy_federate: Federating to 1874/2611 instances (737 dead, 0 disallowed)
lemmy-lemmy-1  | 2023-12-26T18:25:57.289798Z  INFO lemmy_federate: Federation state as of 2023-12-26T18:25:57+00:00:
lemmy-lemmy-1  | 2023-12-26T18:25:57.289844Z  INFO lemmy_federate: 1874 others up to date. 14 instances behind.

And about the performance bit, the container is not restricted so it has 64Gigs of RAM (~40 free) and full twelve threads of Ryzen 5 3600 to use - I don't have graphs/history but the processor mostly idles even with other services running there. So in my case it shouldn't be because of resource starvation.

EDIT2: Interestingly, I didn't notice it before but now I can actually see lines like this, before it showed almost all instances as up to date and only few as behind.

lemmy-lemmy-1  | 2023-12-26T17:52:57.290240Z  INFO lemmy_federate: Federation state as of 2023-12-26T17:52:57+00:00:
lemmy-lemmy-1  | 2023-12-26T17:52:57.290363Z  INFO lemmy_federate: 1111 others up to date. 777 instances behind.
lemmy-lemmy-1  | 2023-12-26T17:53:49.137676Z  INFO lemmy_federate: Federating to 1874/2611 instances (737 dead, 0 disallowed)

This is after 4 days of uptime.

karasugawasu commented 8 months ago

The "Federation state" message is output every minute, but there seems to be nothing wrong with it.

I have been experimenting and have found that only Mastodon is not federating well, while Misskey and Akkoma are federating correctly.

When I follow a Lemmy community or user from Mastodon, it remains in request status.

(Sorry, I don't speak English, so I'm using DeepL translation)

linux-cultist commented 8 months ago
2023-12-30T19:57:58.773121Z  INFO lemmy_federate: Federation state as of 2023-12-30T19:57:58+00:00:
2023-12-30T19:57:58.773146Z  INFO lemmy_federate: lemmy.idevicehacked.com: Warning. 1505 behind, 9 consecutive fails, current retry delay 5120.00s
2023-12-30T19:57:58.773155Z  INFO lemmy_federate: 0 others up to date. 64 instances behind.

Looks worse for me. If I can provide any debug info, let me know...

etymotic commented 8 months ago

Not a solution, but a workaround that I just shared on Lemmy: https://sub.wetshaving.social/post/487989

Adding the following in the crontab keeps federation chugging along, and by not restarting at midnight it helps avoid this bug

0 1 * * * docker container restart lemmy-lemmy-1
0 7 * * * docker container restart lemmy-lemmy-1
0 13 * * * docker container restart lemmy-lemmy-1
0 19 * * * docker container restart lemmy-lemmy-1
jivandabeast commented 8 months ago

Was wondering why my comments suddently stopped being engaged with!! lol

Can confirm that this is still an issue, restarting did help. Going to look into doing something like @etymotic 's workaround

linux-cultist commented 8 months ago

Not a solution, but a workaround that I just shared on Lemmy: https://sub.wetshaving.social/post/487989

Adding the following in the crontab keeps federation chugging along, and by not restarting at midnight it helps avoid this bug

0 1 * * * docker container restart lemmy-lemmy-1
0 7 * * * docker container restart lemmy-lemmy-1
0 13 * * * docker container restart lemmy-lemmy-1
0 19 * * * docker container restart lemmy-lemmy-1

If you want, you can shorten this to:

0 1,7,13,19 * * * docker container restart lemmy-lemmy-1

Looks a bit cleaner so you dont have to have 4 commands in there. :)

I had that command in my crontab for a while, but sometimes Lemmy would fail to start. It seems to work better to take down the entire stack and sleep in the middle:

docker compose down && sleep 5 && docker compose up -d
linux-cultist commented 8 months ago

lemmy.today fixed its federation problem by simply updating their instances table in the db:

https://lemmy.today/comment/4405283

Perhaps other instances are also having similar issues, specially if they have been upgrading from version to version and something happened along the way somehow.

Maybe its not the Lemmy software that has a bug, but rather the 0.19 upgrade triggered federation failure somehow?

Edit: No its not fixed, just improved in some ways. Federation seems to stop after a while and not return until a restart, and even a restart doesnt federate certain posts to certain servers.

sunaurus commented 8 months ago

FYI, on lemm.ee I have been testing 0.19.1 patched with the #4330 changes by @phiresky for the past week and have noticed no further issues with outgoing federation, so I think this issue will be resolved with the next release

linux-cultist commented 8 months ago

@sunaurus That's really good and I feel like this should be released in a minor release as soon as possible since it's a critical bug to fix.

dessalines commented 8 months ago

We'll try to get one out shortly.

linux-cultist commented 7 months ago

@dessalines @Nutomic I think this issue can be closed now. Federation seems to work normally again on 0.19.2 and above. Thank you!