LemmyNet / lemmy

🐀 A link aggregator and forum for the fediverse
https://join-lemmy.org
GNU Affero General Public License v3.0
13.29k stars 884 forks source link

federation between instances showing many comments not replicating (2023-06-13 example) #3101

Closed RocketDerp closed 1 year ago

RocketDerp commented 1 year ago

the normal bug report form does not serve this kind of issue. So far, the problem can be observed and reported, but not identified at the code or log levels.

Even after lemmy.ml upgraded hardware yesterday, this posting serves as an example of how comments are not making it to other instances of Lemmy. This is the same posting, different id on each instance:

https://lemmy.ml/post/1239920 has 19 comments (community home)
https://lemmy.pro/post/1354 has 1 comment
https://sh.itjust.works/post/74015 has 9 comments
https://lemmy.world/post/103943 has 9 comments
https://programming.dev/post/29631 has 13 comments
https://beehaw.org/post/536574 has 7 comments
https://lemmy.haigner.me/post/8 has 6 comments (posting originated with user there)
https://lemmy.wtf/post/1021 has 10 comments

Nutomic commented 1 year ago

I answered this in https://github.com/LemmyNet/lemmy/issues/3062#issuecomment-1591968991

Hellhound1 commented 1 year ago

@Nutomic could you take a look at this then please, I believe it's the same problem but can't find any reason in the logs to suggest why this is happening.

https://lemmy.zip/post/15952 - our community https://lemmy.world/post/72446 - their community

To add to that, if I look at the user's profile from our community, there is only one comment: https://lemmy.zip/u/robotsheepboy@lemmy.world

If I look from their community, lots of comments including others in that community: https://lemmy.world/u/robotsheepboy

Is this the same issue? There is no user ban, and no community ban either.

RocketDerp commented 1 year ago

Issue #3133 is closed due to being a duplicate of the same problem, but it does offer up some more hand-linked examples of how this issue is widespread.

I think instance server operators also need to look at how many users are having 'pending subscribe/join' stuck problems to other federated instances. This is one place in the UI that they can see something isn't right regarding data making it between instance servers. This PostgreSQL query will show you how many pending subscribe/join are in your database: SELECT * FROM community_follower WHERE pending='t'; -- EDIT: I have since opened an issue on this specific problem: #3203

I personally am struggling with my communications on these topics, but I am trying to help. I did start to play with a NodeJS webapp to do direct SQL against Lemmy for instance administrators: https://github.com/RocketDerp/lemmy_helper

Loriborn commented 1 year ago

I'm experiencing a similar issue on my instance but I'm not sure if it's a federation issue or an issue with my configuration. My setup is via Docker, and while I can federate, and see federated instances/communities, I cannot see votes or comments at all.

tabletop.place

Berulacks commented 1 year ago

I'm experiencing a similar issue on my instance but I'm not sure if it's a federation issue or an issue with my configuration. My setup is via Docker, and while I can federate, and see federated instances/communities, I cannot see votes or comments at all.

tabletop.place

I'm having the same issue, trying to figure out what I did wrong during set-up.

Edit: My fault, I was using Postgresql 11 instead of 15 - was getting errors trying to add comments to the DB. Woopsie! After upgrading Postgres and my DB, all's good.

RocketDerp commented 1 year ago

I can federate, and see federated instances/communities, I cannot see votes or comments at all.

At least one user on your site must subscribe/join a community for messages to content to start being shared. Did you do that?

RocketDerp commented 1 year ago

Another example of failure to replicate comments between instances, this is a 4 day old posting where comment activity has settled down (although there were comments within past day):

https://lemmy.ml/post/1244528 has 13 comments
https://beehaw.org/post/548107 has 6 comments
https://lemmy.world/post/112682 has 11 comments
https://lemmy.cloudhub.social/post/14150 has 8 comments
https://feddit.de/post/828172 has 4 comments
https://lemmy.one/post/104026 has 9 comments

Out of 6 servers, not one is equal to the other.

DomiStyle commented 1 year ago

This is by far the biggest issue I'm having with Lemmy. Content being missing from federated instances is sort of a deal breaker. I posted these examples 3 days ago but the situation has still not improved:

https://lemmy.ml/post/1250165

Instance Comments Votes
lemmy.ml 13 95
beehaw.org 6 22
lemmy.world 11 64
sh.itjust.works 10 45

https://beehaw.org/post/548636

Instance Comments Votes
beehaw.org 70 59
lemmy.ml 63 20
lemmy.world 66 160
sh.itjust.works 62 127

This seems to mostly happen with lemmy.ml and beehaw.org so probably related to server load. lemmy.ml is pretty much dead with just 1/4 of comments coming through. In smaller communities sometimes none.

With lemmy.ml it has gotten so bad that even trying to subscribe to a new community is stuck at "Subscribe pending" while subscriptions from lemmy.world are done in under a second. beehaw.org needs a few seconds to come through but it works.

From issue #3062 @Nutomic :

@DomiStyle Im not sure whats the reason, its certainly worth investating. A possibility would be instance blocks or user bans. Could also be networking problems, or a software bug. There is also https://github.com/LemmyNet/lemmy/issues/2142 which means that activities will get lost during restart.

I don't think it's related to instance blocks or user bans, my instance is not blocked on any of the listed instances. Also not related to server restarts since it's been like this for days. Since it happens on all instances probably also not network related.

Is there a way to get more diagnostics info from Lemmy? Like database queries per second, average query response time, running jobs/tasks, incoming/outgoing federation activities, amount of errors per hour/day/week?

It seems like we're poking in the dark right now.

RocketDerp commented 1 year ago

I think PostgreSQL database table structure and record locking are causing silent failures that server operators are not getting their eyes on. It is not bubbling up from SQL timeouts and errors to end-user messages or operator attention.

Once you start getting a significant number of comments in the database, I see signs of contention. SQL query select * from pg_locks; on my Oracle cloud instance (24GB of RAM, running only Lemmy) is showing periods of many locks when I have only one interactive user on my site... federation activity alone is busying up the database to a high degree.

I imagine that I'm not the only instance operator who is interested in building some back-fill protocol to identify and fill in the missing comments and postings. Even if it is strictly lemmy-to-lemmy and not using ActivePub protocols. There has been a general attitude the past 30 days that the key to scaling Lemmy was to move off busy instances and distribute more, but now the federation activity alone is failing with errors that aren't getting bubbled up to operators.

RocketDerp commented 1 year ago

Today there are fresh reports of this ongoing problem:

https://sh.itjust.works/post/257269
https://lemmy.ml/post/1380899
https://lemmy.ml/post/1405339
https://lemmy.ml/post/1404365

RocketDerp commented 1 year ago

lemmy.ml <> lemmy.world post sharing failure

There are two posts from the last 24 hours that are not showing up, sort by "new" to see the latest posts and compare side by side:

https://lemmy.ml/c/mods@lemmy.world/data_type/Post/sort/New/page/1
https://lemmy.world/c/mods/data_type/Post/sort/New/page/1 (Owner of community)

EDIT: 45 minutes later, they are now both showing up. One of the postings was a day old, took an entire day to copy over.

RocketDerp commented 1 year ago

Likely related to this issue, my server log is showing incoming activity that has 'Header is Expired' on the HTTP. Lemmy federation logic is very aggressive in using a short time window in 0.17.4 - and it's entirely possible clock differences between servers and/or retry logic is causing failures. See issue https://github.com/LemmyNet/activitypub-federation-rust/issues/46

The time window has gone from 5 minutes up to 24 hours now in the pending 0.18 release, so maybe this problem will improve.

RocketDerp commented 1 year ago

GOOD NEWS: lemmy.ml just did an upgrade on their backend to 0.18 rc6, and my remote instance can now instantly subscribe to communities!

DomiStyle commented 1 year ago

Looks very promising so far, posts and comments from lemmy.ml are flowing in again.

RocketDerp commented 1 year ago

Ok, for past hour I've subscribed from my remote instance to lemmy.ml communities I had never subscribed to before (at a rate of one community every 2 or 3 minutes). And I just now got my first stuck 'pending'. SPECULATION: The restart of lemmy.ml server may have helped the problem the first couple hours after upgrade, but it seems to be coming back.

RocketDerp commented 1 year ago

Fresh reports of missing comments and posts: https://lemmy.world/post/406956

I am reaching a point where I feel like the major site operators are not publicly, on Lemmy or Github, acknowledging the scaling problems that Lemmy platform is having. Issue #3203 is happening again today from remote instances trying to reach lemmy.ml - and that is after the upgrade to 0.18

Why am I seeing daily "nginx 500" errors that rapidly return from a browser refresh on Beehaw, Lemmy.ml, Lemmy.world - and none of these major operators have open issues with posting their application error logs and nginx logs for us to get our eyes on? The whole community is in scaling crisis.

Making a fool of myself: https://lemmy.ml/post/1453121

RocketDerp commented 1 year ago

REPEAT from over 30 hours ago: I am reaching a point where I feel like the major site operators are not publicly, on Lemmy or Github, acknowledging the scaling problems that Lemmy platform is having.

The 0.18 RELEASE notes should have said that there are major problems with data not making it between instances. The front page of this GitHub repo says that Lemmy is "high performance", and in reality, it is not scaling and the person who did the DDOS yesterday causing multi-hour outage shared the details of just how trivial it is to bring down 0.18.0

jheidecker commented 1 year ago

REPEAT from over 30 hours ago: I am reaching a point where I feel like the major site operators are not publicly, on Lemmy or Github, acknowledging the scaling problems that Lemmy platform is having.

The 0.18 RELEASE notes should have said that there are major problems with data not making it between instances. The front page of this GitHub repo says that Lemmy is "high performance", and in reality, it is not scaling and the person who did the DDOS yesterday causing multi-hour outage shared the details of just how trivial it is to bring down 0.18.0

Uh. Seriously. the elephant in the room right now. Pretty weird to think these admins are already building their little empires for fake internet points. I don't think the developers need to fix scalability problems if someone's instance running on a raspberry pi blows up metaphorically, and then blows up literally. They should change the messaging around "high performance" though.

Nutomic commented 1 year ago

https://github.com/LemmyNet/activitypub-federation-rust/pull/52 will help with this.

artindk commented 1 year ago

Another example. My comment on https://lemmy.world/post/530448 is not visible on https://rblind.com/post/2240607.

RocketDerp commented 1 year ago

Ok, so I have some code to crawl a posting of a community and compare two servers for comments missing. It looks bad today. Both of these servers are version 0.18.0 and have been upgraded for several days.

missing 0 unequal 0 11 on https://lemmy.ml/ vs. 11 on https://sh.itjust.works/
missing 35 unequal 1 48 on https://lemmy.ml/ vs. 14 on https://sh.itjust.works/
missing 4 unequal 0 9 on https://lemmy.ml/ vs. 5 on https://sh.itjust.works/
missing 6 unequal 0 9 on https://lemmy.ml/ vs. 3 on https://sh.itjust.works/
missing 1 unequal 0 1 on https://lemmy.ml/ vs. 0 on https://sh.itjust.works/
missing 6 unequal 0 12 on https://lemmy.ml/ vs. 6 on https://sh.itjust.works/
missing 3 unequal 0 8 on https://lemmy.ml/ vs. 5 on https://sh.itjust.works/
missing 3 unequal 0 6 on https://lemmy.ml/ vs. 4 on https://sh.itjust.works/
missing 22 unequal 0 42 on https://lemmy.ml/ vs. 20 on https://sh.itjust.works/
missing 5 unequal 0 15 on https://lemmy.ml/ vs. 10 on https://sh.itjust.works/
missing 8 unequal 2 17 on https://lemmy.ml/ vs. 9 on https://sh.itjust.works/
missing 3 unequal 0 3 on https://lemmy.ml/ vs. 0 on https://sh.itjust.works/
missing 0 unequal 0 10 on https://lemmy.ml/ vs. 10 on https://sh.itjust.works/
missing 11 unequal 0 24 on https://lemmy.ml/ vs. 13 on https://sh.itjust.works/
missing 1 unequal 0 2 on https://lemmy.ml/ vs. 1 on https://sh.itjust.works/
missing 13 unequal 0 37 on https://lemmy.ml/ vs. 24 on https://sh.itjust.works/
missing 3 unequal 0 7 on https://lemmy.ml/ vs. 4 on https://sh.itjust.works/
missing 0 unequal 0 10 on https://lemmy.ml/ vs. 10 on https://sh.itjust.works/
missing 60 unequal 2 186 on https://lemmy.ml/ vs. 126 on https://sh.itjust.works/
missing 10 unequal 2 51 on https://lemmy.ml/ vs. 41 on https://sh.itjust.works/
missing 16 unequal 0 51 on https://lemmy.ml/ vs. 36 on https://sh.itjust.works/
missing 31 unequal 3 128 on https://lemmy.ml/ vs. 97 on https://sh.itjust.works/
missing 0 unequal 0 4 on https://lemmy.ml/ vs. 4 on https://sh.itjust.works/
missing 2 unequal 0 5 on https://lemmy.ml/ vs. 3 on https://sh.itjust.works/
missing 15 unequal 1 67 on https://lemmy.ml/ vs. 52 on https://sh.itjust.works/
missing 4 unequal 0 53 on https://lemmy.ml/ vs. 49 on https://sh.itjust.works/
missing 0 unequal 0 5 on https://lemmy.ml/ vs. 5 on https://sh.itjust.works/
missing 0 unequal 0 0 on https://lemmy.ml/ vs. 0 on https://sh.itjust.works/
missing 1 unequal 0 19 on https://lemmy.ml/ vs. 18 on https://sh.itjust.works/
missing 0 unequal 0 2 on https://lemmy.ml/ vs. 2 on https://sh.itjust.works/
missing 0 unequal 0 22 on https://lemmy.ml/ vs. 22 on https://sh.itjust.works/
missing 0 unequal 0 16 on https://lemmy.ml/ vs. 18 on https://sh.itjust.works/
missing 0 unequal 0 7 on https://lemmy.ml/ vs. 7 on https://sh.itjust.works/
missing 3 unequal 0 27 on https://lemmy.ml/ vs. 24 on https://sh.itjust.works/
missing 2 unequal 0 32 on https://lemmy.ml/ vs. 30 on https://sh.itjust.works/
missing 3 unequal 0 21 on https://lemmy.ml/ vs. 18 on https://sh.itjust.works/
missing 3 unequal 1 16 on https://lemmy.ml/ vs. 13 on https://sh.itjust.works/
missing 3 unequal 1 47 on https://lemmy.ml/ vs. 44 on https://sh.itjust.works/
missing 1 unequal 0 24 on https://lemmy.ml/ vs. 23 on https://sh.itjust.works/

The number of comments is based on loading comments, not the counts at the top of the posting.

RocketDerp commented 1 year ago

missing 0 unequal 0 0 on https://lemmy.ml/ vs. 0 on https://sh.itjust.works/
missing 1 unequal 0 1 on https://lemmy.ml/ vs. 0 on https://sh.itjust.works/
missing 2 unequal 0 2 on https://lemmy.ml/ vs. 0 on https://sh.itjust.works/
missing 24 unequal 0 25 on https://lemmy.ml/ vs. 3 on https://sh.itjust.works/
missing 0 unequal 0 0 on https://lemmy.ml/ vs. 0 on https://sh.itjust.works/
missing 10 unequal 0 14 on https://lemmy.ml/ vs. 4 on https://sh.itjust.works/
missing 3 unequal 0 4 on https://lemmy.ml/ vs. 1 on https://sh.itjust.works/
missing 4 unequal 0 4 on https://lemmy.ml/ vs. 0 on https://sh.itjust.works/
missing 6 unequal 0 7 on https://lemmy.ml/ vs. 1 on https://sh.itjust.works/
missing 1 unequal 0 1 on https://lemmy.ml/ vs. 0 on https://sh.itjust.works/
missing 3 unequal 0 3 on https://lemmy.ml/ vs. 0 on https://sh.itjust.works/
missing 0 unequal 0 0 on https://lemmy.ml/ vs. 0 on https://sh.itjust.works/
missing 0 unequal 0 0 on https://lemmy.ml/ vs. 0 on https://sh.itjust.works/
missing 9 unequal 0 11 on https://lemmy.ml/ vs. 2 on https://sh.itjust.works/
missing 0 unequal 0 0 on https://lemmy.ml/ vs. 0 on https://sh.itjust.works/
missing 0 unequal 0 0 on https://lemmy.ml/ vs. 0 on https://sh.itjust.works/
missing 1 unequal 0 1 on https://lemmy.ml/ vs. 1 on https://sh.itjust.works/
missing 1 unequal 0 1 on https://lemmy.ml/ vs. 1 on https://sh.itjust.works/
missing 1 unequal 0 0 on https://lemmy.ml/ vs. 1 on https://sh.itjust.works/
missing 1 unequal 0 3 on https://lemmy.ml/ vs. 2 on https://sh.itjust.works/
missing 5 unequal 0 8 on https://lemmy.ml/ vs. 5 on https://sh.itjust.works/
missing 3 unequal 0 10 on https://lemmy.ml/ vs. 7 on https://sh.itjust.works/
missing 6 unequal 0 7 on https://lemmy.ml/ vs. 1 on https://sh.itjust.works/
missing 5 unequal 1 9 on https://lemmy.ml/ vs. 4 on https://sh.itjust.works/
missing 2 unequal 0 11 on https://lemmy.ml/ vs. 9 on https://sh.itjust.works/
missing 0 unequal 0 0 on https://lemmy.ml/ vs. 0 on https://sh.itjust.works/
missing 8 unequal 1 16 on https://lemmy.ml/ vs. 8 on https://sh.itjust.works/
missing 1 unequal 0 3 on https://lemmy.ml/ vs. 2 on https://sh.itjust.works/
missing 2 unequal 0 6 on https://lemmy.ml/ vs. 4 on https://sh.itjust.works/
missing 1 unequal 0 1 on https://lemmy.ml/ vs. 0 on https://sh.itjust.works/
missing 3 unequal 0 4 on https://lemmy.ml/ vs. 1 on https://sh.itjust.works/
missing 0 unequal 0 2 on https://lemmy.ml/ vs. 2 on https://sh.itjust.works/
missing 5 unequal 0 10 on https://lemmy.ml/ vs. 5 on https://sh.itjust.works/

Nutomic commented 1 year ago

The federation fix mentioned above is still not merged (https://github.com/LemmyNet/lemmy/pull/3379). It will be included in one of the next rcs, so you should wait a bit with further testing. Anyway pasting different comment counts is not helpful at all.

RocketDerp commented 1 year ago

These problems have to do with the PostgreSQL backend and timeouts, as too with the federation HTTP design and timeouts because servers swarm each other with concurrent federation activity. It is not just issue #3379, and the lack of database caching in the lemmy_server application is one of the fundamental causes.

sunaurus commented 1 year ago

I am seeing much improved federation on 0.18.1. After lemmy.world upgraded to 0.18.1 today, there is a big amount of lemmy.world posts and comments now visible on the lemm.ee front page - it's a huge improvement compared to when lemmy.world was on 0.17.4.

RocketDerp commented 1 year ago

there is a big amount of lemmy.world posts and comments now visible on the lemm.ee

I'm seeing more post/comment delivery from lemmy.world too. Great news! I'm seeing over 10 delivered comments a minute, most recent 5 minute period: https://lemmyadmin.BulletinTree.com/query/comments_ap_id_host_prev?output=table&timeperiod=5

@sunaurus : lemmy.world upgraded to 0.18.1 today,

https://lemmy.world/post/920294 : "we created extra lemmy containers to spread the load. (And extra lemmy-ui containers). And used nginx to load balance between them."

To have to fire up several lemmy_server services against a singularity of PostgreSQL on the same underlying hardware implies that Rust isn't scaling or is running into resource limits that aren't being logged (in a clear manner).

sunaurus commented 1 year ago

lemmy.ml is not on 0.18.1 yet @anonCantCode

RocketDerp commented 1 year ago

Things are far better with the performance fixes installed on Lemmy.world and Lemmy.ml - I'm inclined to close this issue since PostgreSQL is no longer constantly overloaded on any of the servers that have updated and comments are replicating far better.

Dakkaron commented 1 year ago

The improved federation due to performance fixes is really good and important.

But is there any kind of retry mechanism in case syncing fails?

With growing amounts of users, Lemmy is bound to run into performance issues again.

Would be good to have some kind of eventual consistency features here.

sunaurus commented 1 year ago

There is a retry mechanism, but no guaranteed eventual consistency.

RocketDerp commented 1 year ago

@sunaurus - are the retries based on the timing of the original individual outgoing event, such as a comment? If so, is there some kind of consistency check before trying to send a comment to ensure the other peer already received the post? I am wondering about a race condition of a post in retry sleep state while fresh comments try to deliver immediately. Thank you.

james2432 commented 1 year ago

https://lemmy.sdf.org/c/lotro@lemmy.ca 0 posts https://lemmy.ca/c/lotro 7 posts

RocketDerp commented 1 year ago

https://lemmy.sdf.org/c/lotro@lemmy.ca 0 posts https://lemmy.ca/c/lotro 7 posts

This isn't a replication problem, the posts are there if you sort by "New": https://lemmy.sdf.org/c/lotro@lemmy.ca?dataType=Post&page=1&sort=New

They have not updated their server to 0.18.1 final release, which fixes a problem with "Hot" and "Active" sorting.

airjer commented 1 year ago

Same issue here with my own instance on the latest release. Posts from over 24 hours ago still aren’t showing up on the alternate instance.

RocketDerp commented 1 year ago

I'm closing this issue given that major performance improvements since backed version 0.17.4 have largely fixed the "many" description. If anything, there are a slew of issues that need testing once 0.18.3 is released to make sure moderator actions (removing a post, for example) are federated to all instances correctly. No longer is federation regularly failing due to server overload like it was all of June 2023. A fresh issue can be opened if 0.18.3 proves to have problems in the field.

alesito85 commented 1 year ago

Any tips on what to do if this is happening to new instances?