LemmyNet / lemmy

🐀 A link aggregator and forum for the fediverse
https://join-lemmy.org
GNU Affero General Public License v3.0
13.12k stars 865 forks source link

Question: Collaboration on better comment ranking algorithms? #4924

Open fdietze opened 1 month ago

fdietze commented 1 month ago

Question

Hey everyone,

we're a small open research group currently working on new comment ranking algorithms for discussion trees. Our goal is to make identifying and debunking of misinformation in discussions more effective. Technically we're analyzing voting patterns using bayesian statistics.

Is there interest in the lemmy community to collaborate on that goal?

Everything we do is open-source. An earlier project we worked on was a new ranking metric for the Hacker News frontpage: https://github.com/social-protocols/news#readme

Happy to answer any questions.

dessalines commented 1 month ago

Sure, seems interesting. I'm all for bettering or adding additional sorts.

fdietze commented 1 month ago

Great! The biggest challenge we're currently facing is evaluation with real-life scenarios. We already have simulations which confirm that our algorithms do what we expect them to do, but that's no guarantee that they'll work well in the wild.

I see two options to evaluate right now:

The exact data the algorithms needs is a vote-stream which contains:

Any idea what's the most practical way forward?

dullbananas commented 1 month ago

If it doesn't give the highest rank to recent stuff only, then it should be added as a new sort type, maybe called "best" or "fair"

fdietze commented 1 month ago

If it doesn't give the highest rank to recent stuff only, then it should be added as a new sort type, maybe called "best" or "fair"

Right. Our algorithm doesn't sort by date. The most descriptive name, we've come up with so far is "convincing".

A bit more in depth: It empirically measures "convincingness" in voting patterns and bubbles up the most convincing comments for every parent. We define convincing as a comment, which measurably changes voting behavior of its parent. The idea is to focus more attention on convincing comments and their convincing replies recursively. Since misinformation and debunking information is usually convincing, it should (in theory) help to debunk misinformation faster.

fdietze commented 1 month ago

I looked a bit into the lemmy database schema. I think the easiest first step is to analyze some existing voting data from comment trees. That's much easier than implementing the full algorithm in this code base. I prepared some queries which return exactly the data we need for a first analysis:

-- all posts
select id, name, url, body, published from post
-- all comments with their parent_id
SELECT 
    id, 
    post_id, 
    content,
    published,     
    CASE 
        WHEN nlevel(path) = 2 THEN NULL 
        ELSE ltree2text(subpath(path, nlevel(path) - 2, 1)) 
    END AS parent_id
FROM 
    comment
-- anonymized post likes, CHANGE THE SALT!
-- salt must be the same in all queries.
select sha224(('random salt 123' || person_id)::bytea) as person_id, post_id, score, published from post_like;
-- anonymized comment likes, CHANGE THE SALT!
-- salt must be the same in all queries.
select sha224(('random salt 123' || person_id)::bytea) as person_id, comment_id, score, published from comment_like

Would it be ok to run those on a popular instance and publish the data (or hand it over in private)? I would then analyse it with our group and share insights here.

MV-GH commented 1 month ago

Technically all this data is publicly available through the REST API of any instance. So you can retrieve it that way too, but I would suggest asking permission from an admin.

fdietze commented 1 month ago

Technically all this data is publicly available through the REST API of any instance. So you can retrieve it that way too, but I would suggest asking permission from an admin.

I just double checked the API docs again. As far as I understand, the API only offers vote aggregates, not individual votes per person. Am I overlooking something?

Nothing4You commented 1 month ago

The API does indeed only return individual votes to admins or community moderators, but you can see the number of upvotes and the number of downvotes. Note that typically (some exceptions apply) all posts and comments automatically receive an upvote by their creator.

Does it matter which user a vote is from for the result or is that just due to the way it's implemented without affecting the outcome? If it doesn't affect the outcome it could probably just be substituted with a random id for testing.

It should also be noted that the SQL queries above do not provide anonymization. They provide pseudonymization, and this is rather easily reversible given the association with post ids and comment ids. For the majority of posts and comments with just a single upvote this vote will be by the creator, so this is a decent way to de-pseudonymize many users that have posted or commented before.

There are also some other Fediverse applications out there that publicly display votes, which could further be used to almost (remote instances typically don't have 100% coverage of local content) fully de-pseudonymize the entire list of votes.

Considering that Lemmy does not make individual votes public, such a dataset, if provided, should likely not be publicly posted.

fdietze commented 1 month ago

Note that typically (some exceptions apply) all posts and comments automatically receive an upvote by their creator.

Very important point! That allows to deanonymize users.

Does it matter which user a vote is from for the result or is that just due to the way it's implemented without affecting the outcome? If it doesn't affect the outcome it could probably just be substituted with a random id for testing.

Individual votes matter for the calculation. Imagine a comment A and a replying comment B. We statistically measure if users who upvoted B voted differently on A. That can't be calculated from the counts alone.

But in fact, the author's vote doesn't matter here. It can just be left out of the data.

The calculations are per discussion tree, so we should also include the post_id in the hash.

With those changes applied to the queries, would there still be room for deanonymization?

Nothing4You commented 1 month ago
-- get all posts
select post.id, post.name, post.url, post.body, post.published
from post
join community on post.community_id = community.id
where post.deleted = false
and post.removed = false
and community.removed = false
and community.deleted = false
;

-- get all comments
select
    comment.id,
    comment.post_id,
    comment.content,
    comment.published,
    case
        when nlevel(comment.path) = 2 then null
        else ltree2text(subpath(comment.path, nlevel(comment.path) - 2, 1))
    end as parent_id
from comment
join post on post.id = comment.post_id
join community on post.community_id = community.id
where comment.deleted = false
and comment.removed = false
and post.deleted = false
and post.removed = false
and community.removed = false
and community.deleted = false
;

-- pseudonymized post likes, CHANGE THE SALT!
-- salt must be the same in all queries.
select
    sha224(('random salt 123' || person_id)::bytea) as person_id,
    post_like.post_id,
    post_like.score,
    post_like.published
from post_like
join post on post.id = post_like.post_id
join community on post.community_id = community.id
where post_like.person_id != post.creator_id
and post_like.score != 0
and post.deleted = false
and post.removed = false
and community.removed = false
and community.deleted = false
;

-- pseudonymized comment likes, CHANGE THE SALT!
-- salt must be the same in all queries.
select
    sha224(('random salt 123' || person_id)::bytea) as person_id,
    comment_like.comment_id,
    comment_like.score,
    comment_like.published
from comment_like
join comment on comment.id = comment_like.post_id
join post on post.id = comment.post_id
join community on post.community_id = community.id
where comment_like.person_id != comment.creator_id
and comment_like.score != 0
and comment.deleted = false
and comment.removed = false
and post.deleted = false
and post.removed = false
and community.removed = false
and community.deleted = false
;

these would probably be better queries then. the queries above can produce holes if parent comments have been deleted or removed.

in the end, realistically, vote data (at least in non-local-only communities) should be considered more or less public, even though it's not always easily accessible to everyone without some additional work. as i mentioned before, some other fediverse software makes votes publicly visible to everyone. in addition to that, even if that wasn't the case, all you'd need to do would be running a fediverse instance on your own to passively listen for other instances to send this information to you over time.

if you can find content that only has 2 upvotes (1 if you exclude the creator) and then on another fediverse instance that has public votes on content you can see which person this is, and you can use this dataset to match all other votes that this person did. I don't know whether any software currently provides an overview of all votes by a user.

there isn't really a clear line here imo, so instance admins will have to consider what or if they can share this data, but with a bit of time investment all this information can already be automatically and mostly passively collected (for new content). it would probably still reduce the risk of abuse for e.g. harassment if this was not publicly published as a full dataset for everyone to just download.

fdietze commented 1 month ago

Thank you for revising the queries! They look much better now.

Any recommendations for which instance admins I should approch?

Nothing4You commented 1 month ago

@dessalines could provide some information about lemmy.ml, which is probably the oldest instance around, although not necessarily the most complete one. due to bugs, limited federation, age or lack of community subscriptions there won't be any single instance that has all data.

lemmy.world as the largest instance is probably worth reaching out to as well. hexbear.net seems to have the most posts and comments, and it also exists for a lot longer than most others. sh.itjust.works and lemm.ee are likely also good options with instances that have a lot of data.

https://lemmyverse.net/?order=posts and https://lemmyverse.net/?order=comments might be useful.

you can ignore lemmit.online, that's a reddit repost instance. there are a few more that may be mostly or only mirroring reddit content as well.

also keep in mind that post/comment/user ids are local to an instance, so you can't merge data from multiple instances this way. if data should be mergable it'd require using the ap_ids of posts/comments and actor_id of users, plus sharing the salt across those instances.

fdietze commented 1 month ago

Do I understand correctly that the ap_id/actor_id can be used (instead of the id) as a global id, so that datasets from multiple instances can be trivially merged? (assuming the same salt)

With that understanding, I revised the queries another time:

@Nothing4You if you approve, I'll start approaching instance admins. Thank you for your help!

-- get all posts
select ap_id, post.name, post.url, post.body, post.published
from post
join community on post.community_id = community.id
where post.deleted = false
and post.removed = false
and community.removed = false
and community.deleted = false
;

-- get all comments
select
    comment.ap_id,
    post.ap_id as post_ap_id,
    comment.content,
    comment.published,
    parent_comment.ap_id as parent_ap_id
from comment
join post on post.id = comment.post_id
join community on post.community_id = community.id
join comment as parent_comment on parent_comment.id = (
    case
        when nlevel(comment.path) = 2 then null
        else ltree2text(subpath(comment.path, nlevel(comment.path) - 2, 1))
    end)::int
where comment.deleted = false
and comment.removed = false
and post.deleted = false
and post.removed = false
and community.removed = false
and community.deleted = false
;

-- pseudonymized post likes, CHANGE THE SALT!
-- salt must be the same in all queries.
select
    sha224(('random salt 123' || person.actor_id)::bytea) as actor_id,
    post.ap_id as post_ap_id,
    post_like.score,
    post_like.published
from post_like
join post on post.id = post_like.post_id
join community on post.community_id = community.id
join person on person.id = person_id
where post_like.person_id != post.creator_id
and post_like.score != 0
and post.deleted = false
and post.removed = false
and community.removed = false
and community.deleted = false
;

-- pseudonymized comment likes, CHANGE THE SALT!
-- salt must be the same in all queries.
select
    sha224(('random salt 123' || person.actor_id)::bytea) as actor_id,
    comment.ap_id as comment_ap_id,
    comment_like.score,
    comment_like.published
from comment_like
join comment on comment.id = comment_like.post_id
join post on post.id = comment.post_id
join community on post.community_id = community.id
join person on person.id = person_id
where comment_like.person_id != comment.creator_id
and comment_like.score != 0
and comment.deleted = false
and comment.removed = false
and post.deleted = false
and post.removed = false
and community.removed = false
and community.deleted = false
;

On a local instance, it produces these results:

+--------------------------+---------+--------+--------+-------------------------------+
| ap_id                    | name    | url    | body   | published                     |
|--------------------------+---------+--------+--------+-------------------------------|
| https://localhost/post/1 | my post | <null> | <null> | 2024-08-03 12:30:16.64602+00  |
| https://localhost/post/2 | howdy   | <null> | <null> | 2024-08-03 12:42:57.863328+00 |
+--------------------------+---------+--------+--------+-------------------------------+
SELECT 2
+-----------------------------+--------------------------+----------------+-------------------------------+-----------------------------+
| ap_id                       | post_ap_id               | content        | published                     | parent_ap_id                |
|-----------------------------+--------------------------+----------------+-------------------------------+-----------------------------|
| https://localhost/comment/2 | https://localhost/post/1 | deeper comment | 2024-08-03 12:30:34.598235+00 | https://localhost/comment/1 |
| https://localhost/comment/3 | https://localhost/post/1 | even deeper    | 2024-08-03 12:33:17.941832+00 | https://localhost/comment/2 |
| https://localhost/comment/5 | https://localhost/post/2 | haha           | 2024-08-03 12:46:34.498662+00 | https://localhost/comment/4 |
+-----------------------------+--------------------------+----------------+-------------------------------+-----------------------------+
SELECT 3
+------------------------------------------------------------+--------------------------+-------+-------------------------------+
| actor_id                                                   | post_ap_id               | score | published                     |
|------------------------------------------------------------+--------------------------+-------+-------------------------------|
| \x0ffa5c0c182f901fec5a5114635b9673ac72aaccaf4e2d307d60c12e | https://localhost/post/1 | -1    | 2024-08-07 11:32:43.342445+00 |
| \x0ffa5c0c182f901fec5a5114635b9673ac72aaccaf4e2d307d60c12e | https://localhost/post/2 | 1     | 2024-08-07 11:32:41.581544+00 |
+------------------------------------------------------------+--------------------------+-------+-------------------------------+
SELECT 2
+------------------------------------------------------------+-----------------------------+-------+-------------------------------+
| actor_id                                                   | comment_ap_id               | score | published                     |
|------------------------------------------------------------+-----------------------------+-------+-------------------------------|
| \x0ffa5c0c182f901fec5a5114635b9673ac72aaccaf4e2d307d60c12e | https://localhost/comment/2 | 1     | 2024-08-07 11:37:18.5379+00   |
| \x0ffa5c0c182f901fec5a5114635b9673ac72aaccaf4e2d307d60c12e | https://localhost/comment/2 | -1    | 2024-08-07 11:37:19.740881+00 |
+------------------------------------------------------------+-----------------------------+-------+-------------------------------+
SELECT 2
Nothing4You commented 1 month ago

Do I understand correctly that the ap_id/actor_id can be used (instead of the id) as a global id, so that datasets from multiple instances can be trivially merged? (assuming the same salt)

this is correct. those ids will be globally unique identifiers typically. there can be exceptions to this if an instance is torn down and recreated without adjusting the auto increment counters, which can otherwise lead to reuse of identifiers. similarly, user accounts could have been purged at some point, which could be registered again afterwards. the combination of the ap_id/actor_id and the published field of the same object should be good enough to detect accidental false positives, assuming nobody intentionally reused ids and backdated them. i don't have any idea how common such cases of reuse are, but i've seen them in the past occasionally when people were resetting their instance following some issues they were seeing before.