The Filter Duplicate Boosts from local TL feature is filtering non-duplicate boosts as well

mannazsci commented 9 months ago

Steps to reproduce the problem

I boosted this post: https://scholar.social/@researchfairy/111840719521713968
It appeared on the local TL and then disappeared on refreshing even though there are no other boosts of the same post.
The boost appears on my profile though.
Similar case with this post: https://mastodon.social/@koen_hufkens/111840494469557835 that El had boosted.
I saw the same thing happen earlier with a post from mastodon engg account. We tested this with @sneakers-the-rat and it appeared on the local TL only after Jonny boosted it, I unboosted the post and reboosted. ...

Expected behaviour

Only duplicate boosts should be filtered

Actual behaviour

Non-duplicate boosts are being filtered

Detailed description

What's happening is that for some instances, even the first boost is being filtered out.

Relevant feature: https://wiki.neuromatch.social/Filter_Duplicate_Boosts I'm no query expert but something in the query is filtering non-duplicate boosts for certain posts (not all) from the local TL.

Mastodon instance

neuromatch.social

Mastodon version

v4.3.0-alpha.0+glitch

Technical details

If this is happening on your own Mastodon server, please fill out those:

Ruby version: ruby 3.0.6p216 (2023-03-30 revision 23a532679b) [x86_64-linux]
Node.js version: v16.20.2

sneakers-the-rat commented 9 months ago

Going to do some direct DB queries to debug this rn

sneakers-the-rat commented 9 months ago

PGHero gives us this explanation of a query:

Limit  (cost=0.86..120023.81 rows=20 width=16)
  ->  Nested Loop  (cost=0.86..115462078.42 rows=19240 width=16)
        ->  Index Scan Backward using statuses_pkey on statuses  (cost=0.43..115418337.80 rows=19293 width=24)
              Index Cond: (id < '111841145161125732'::bigint)
              Filter: ((local OR (uri IS NULL)) AND (deleted_at IS NULL) AND ((NOT reply) OR (in_reply_to_account_id = account_id)) AND (visibility = 0) AND ((reblog_of_id IS NULL) OR (id = (SubPlan 1))))
              SubPlan 1
                ->  Aggregate  (cost=7.15..7.16 rows=1 width=8)
                      ->  Index Scan using index_statuses_on_reblog_of_id_and_account_id on statuses s2  (cost=0.43..7.13 rows=8 width=8)
                            Index Cond: (reblog_of_id = statuses.reblog_of_id)
        ->  Index Scan using accounts_pkey on accounts  (cost=0.42..2.27 rows=1 width=8)
              Index Cond: (id = statuses.account_id)
              Filter: ((suspended_at IS NULL) AND (silenced_at IS NULL))
JIT:
  Functions: 20
  Options: Inlining false, Optimization false, Expressions true, Deforming true

So there is something going on here with not having a specific index for reblog_of_id where it's intersecting with the index_statuses_on_reblog_of_id_and_account_id index. So I think we do need that index after all

sneakers-the-rat commented 9 months ago

When I add the index, the EXPLAIN now reads:

Limit  (cost=20.40..20.40 rows=1 width=16)
  ->  Sort  (cost=20.40..20.40 rows=1 width=16)
        Sort Key: statuses.id DESC
        ->  Nested Loop  (cost=0.14..20.39 rows=1 width=16)
              Join Filter: (statuses.account_id = accounts.id)
              ->  Index Scan using index_statuses_20190820 on statuses  (cost=0.14..9.47 rows=1 width=24)
                    Index Cond: ((id < '111841145161125732'::bigint) AND (visibility = 0))
                    Filter: ((local OR (uri IS NULL)) AND ((NOT reply) OR (in_reply_to_account_id = account_id)) AND ((reblog_of_id IS NULL) OR (id = (SubPlan 1))))
                    SubPlan 1
                      ->  Aggregate  (cost=1.30..1.31 rows=1 width=8)
                            ->  Seq Scan on statuses s2  (cost=0.00..1.30 rows=1 width=8)
                                  Filter: (reblog_of_id = statuses.reblog_of_id)
              ->  Seq Scan on accounts  (cost=0.00..10.90 rows=1 width=8)
                    Filter: ((suspended_at IS NULL) AND (silenced_at IS NULL))

and from my reading it will do a Seq Scan when doing an index scan is too costly to start up (there are few records), but for some reason my data faking spec test is no longer persisting to the db, but i suspect this is the problem. I haven't yet figured out a satisfying explanation why that query would return false because theoretically you should be able to use the existing [reblog_of_id, account_id] index in a similar way to just having a [reblog_of_id] index, but maybe since the id column isn't in it, then it short circuits somewhere?

I can't replicate the bug on my dev instance, but since the bug is ongoing and is affecting multiple ppl, it might be worth deploying an update that just adds the index and see if the problem persists? sort of a janky way of testing in prod, but i don't have a ton of better ideas.

mannazsci commented 9 months ago

Thanks for looking into this! If adding the index turns out to be too costly, could we roll it back? If so, then I think it may be worth trying. If not, then I think this is a minor bug that doesn't need immediate fixing as the boost certainly appears on the person's profile and the home feed as well (if enabled - which is the default).

sneakers-the-rat commented 9 months ago

Yes we definitely could roll it back - we would just want to be mindful to keep our changes linear. Adding and removing an index is more undoable because we arent changing the shape of the data, but if we were like adding/removing a table or column or whatevs we would have to be a lot more careful if ppl were relying on it. I want to read more about it before trying it but from what I understand it should be possible to try. I still want to try some direct db queries to try debugging and I may do that now that im home

sneakers-the-rat commented 9 months ago

OH OBVIOUSLY IT IS BECAUSE PEOPLE THAT AREN'T ON OUR SERVER BOOSTING A POST ARE ALSO IN OUR DATABASE

SELECT "statuses"."id",
       "statuses"."updated_at",
       "statuses"."reblog_of_id",
       "statuses"."text",
       "accounts"."username"
FROM "statuses"
INNER JOIN "accounts" ON "accounts"."id" = "statuses"."account_id"
WHERE statuses.reblog_of_id = 111842966894960735 
ORDER BY "statuses"."id";

and the most recent boost is someone on a different instance. EZPZ

mannazsci commented 9 months ago

lol, amazing work figuring this out!!

sneakers-the-rat commented 9 months ago

closed by https://github.com/NeuromatchAcademy/mastodon/pull/38

NeuromatchAcademy / mastodon