HermanMartinus / bearblog

Free, no-nonsense, super fast blogging.
MIT License
2.63k stars 75 forks source link

Discovery feed algorithm needs work #147

Closed noahtf13 closed 2 years ago

noahtf13 commented 2 years ago

See https://miro.medium.com/max/800/0*21Ezm5SbYie_a3oD.png

Even just a log of votes as the numerator of the current algorithm you list would be helpful. Posts that blow up on the discover feed tend to "stick" for a week or more, but maybe that's the intent?

tsklxiv commented 2 years ago

I agree with this. Posts with tons of upvotes tend to stick for way too long on Trending, and they make the Trending tab bland and uninteresting.

HermanMartinus commented 2 years ago

I've been giving this a lot of thought recently. It's a pretty interesting problem with the following constraints:

  1. It has to run as a database query. Parsing the number of posts + their upvotes is way too taxing to do on the server, so the algorithm is constrained to what can be done as a DB query.
  2. It needs to show content which is "trending", which I interpret to mean "getting a decent amount of attention right now".
  3. Needs to prioritise newer content
  4. Needs to find a balance between new content and good content (judged by votes)

The reason the algorithm has things stick to the top like this is the scale of the difference in the upvotes, which completely overwhelms the gravity attribute. gravity is great at keeping content fresh, assuming a decent distribution of upvotes. Unfortunately, as noted, if something blows up on HN it completely overwhelms that gravity as it has orders of magnitude more upvotes than the other content.

Here's one proposed solution: Have the Trending tab show only posts from the past 28 days arranged by number of upvotes, then randomly intersperse (say, every 3-6 posts) something that only has 1-5 upvotes. This checks the following boxes:

Let me know your thoughts and suggestions

HermanMartinus commented 2 years ago

@noahtf13 I like the reddit-like algorithm, but there are currently no downvotes, and we unfortunately can't use "views" as downvotes as it would require some chunky servers to parse the "upvote-rate". Food for thought though.

HermanMartinus commented 2 years ago

I've given this some more thought, and I think I have a solution:

Modifying the existing algorithm (which works really well assuming there aren't huge outliers) so that the number of upvotes has a logarithmic decay. Essentially the first 10 upvotes has the same value as the next 100, which has the same value as the next 1000.

This means that something with 1000 upvotes isn't too much of an outlier and will behave correctly in the existing algorithm.

HermanMartinus commented 2 years ago

@noahtf13 so you were right, just logging the upvotes is a good way to handle this, it just took me a while (and a good chunk of unusable code) to agree 😅 . I've pushed an update to test on production a bit before it goes live. You can see it here https://bearblog.dev/discover/?test=true&gravity=1.1 (you can also adjust the gravity in the url parameter which I plan to make a feature for people who want to play around with it).

noahtf13 commented 2 years ago

Really appreciate the thought you put into this and other parts of the site, it shows!

HermanMartinus commented 2 years ago

I've just released a new algorithm for the discovery feed. I decided to switch is up and use a reddit-like time since Jan 1st 2020 to published date instead of the HN like time since published. This allows me to compute a score for each post on upvote as opposed to computing a score for all articles on each discover page load. The has a similar effect while being more computationally friendly.

Score = log10(U) + (S / D * 8600)

Where,
U = Upvotes (toasts) of a post
S = Seconds since Jan 1st, 2020
D = Days modifier (currently at 7)

D values is used to specify that content D days old needs to have 10 times as many upvotes as something published now in order to outrank it.

I'm going to do a longer writeup on all the stuff I've learnt about ranking algorithms over the past week, so subscribe to my blog if you're interested to read it.

Give it a spin and let me know what you think. If you still prefer the old algorithm it's available at https://bearblog.dev/discover?old=true

noahtf13 commented 2 years ago

Amazing! And already read through RSS! 👍