SteamDatabase / steamdb.info-issues

🚱 Issue tracker for the SteamDB website
https://steamdb.info
The Unlicense
414 stars 60 forks source link

Change in SteamDB rating calculation #793

Open hubertsng opened 1 year ago

hubertsng commented 1 year ago

Feature Description

I've used steamDB for quite a while now, it's the best info for everything for the most part. Due to the wealth of trust I have in this site, I kind of blindly trusted the SteamDB rating calculation until reading on how it was calculated.

I don't understand on how it is a good way to create a more accurate score than Steam's. I understand Steam may have issues by just taking raw value, with the most noticeable being low review count games but you can have games filtered out by review count. Either way, I understand the purpose of the change and logic on why lower confidence limit for binomial distribution would be utilized.

I'm going to use some basic statistical jargon without explaining so be warned. There are two big issues I have with the current scoring method. Before I get into it, I will say one small issue I feel like is important is having the SteamDB rating being directly comparable to the Steam raw rating. One may, as I did, see the lower rating on SteamDB and assume that it could be utilized as a better approximation of the true score rather than the lower bound.

The big issues are:

  1. Switching away from Wilson's scoring method to something not statistically backed and just chosen due to its easy comprehension. Pardon my lack of formatting skills but to break it down, the new formula is basically the Wilson (or simpler, the Wald) but changing how the population impacts the standard deviation. Current formula for the standard deviation: (Review Score - 0.5)*2^{-log_{10}(Total Reviews + 1)} versus the Wald standard deviation calculation sqrt{(Review Score)(1-Review Score)/(Total Reviews)} Cleaning it up to get to the core components of it so toss out the sqrt. Total Reviews + 1 is in the original to avoid reviews of 1 resulting in the log formula to net 0, creating problems. The wald divisor would be the 1/2^polynomial for the current formula and you can see the numerators of each being quite similar of (Review Score)(1-Review Score) vs (Review Score - 0.5). Multiply it out and add the sqrt and it's sqrt(Review Score - Review Score^2). There are mathematics where you can create an approximation of sqrts for things that are bot perfect squares but regardless, they end up similar. With Review Score always being <1.0, the subtractor Review Score^2 matches the same idea of 0.5. Basic moving stuff around math aside, the big change would just be the addition of the null hypothesis that games are average of 0.50, a change from the Wald which just draws a confidence interval around that single point. To compare different games, then you would be using a different statistical test to compare those two and you can then print the one that is statistically significantly higher or whatever or if the same default to the one with more reviews. But that's a lot of work and gets messy and I have no idea how to even do that for 1000 games so that method isn't possible. There are three ways to improve it if the Wald SE was to be used. Convert the current formula to Wilson SE but apply different transformations on the data. You can transform it to have a different penalty factor (denominator) and numerator but keep the basis of it more similar. You can also use a poisson distribution approximation for larger review scores that may make it better. Another fix is changing the bounds of the confidence interval. The confidence interval calculated with Wilson of p=0.05 can be adjusted higher to get higher scores for high review games while low review games with highly variable confidence intervals will be still below the less variable games. The last one could be normalizing the scores to 0.5 (or whatever number). Instead of taking the value off, you can create a new probability score based on the scale [0.5, 1] and multiply it by 2. It would lower the SE for high review games and increase it for lower review games. With the value Review Score - 0.5, you are moving the high review games closer to 0.5, where the standard deviation for Wilson is naturally higher.

  2. I would prefer to ditch the whole SteamDB rating entirely based on confidence interval. As I said above, it is a misrepresentation of what the score actually means and makes it seem as though it is far more comparable to Steam's raw rating than anything else. This also brings in complications with deciding the proper penalizing factor and numerator, leading to a statistical questionable formula for calculating SE. Move away from lower bound of confidence interval and instead, think of adjusting to a p-value of a comparison vs the null (Currently 0.5) This makes the scoring actually something statistically reasonable and has easy sorting. Hide the p-values as it serves no purpose to the end user but it'll show games that are more likely on average to be better than those below it on the list. I want the null to change from 0.5 to either user-inputted, 0.7, or 0.8. User input for this I think would be tremendously better. The null of 0.5 is useful more for people that want to know if the game is better than a 50% steam rating game. No one cares about that, those sorting the list would find it most useful if comparing to the baseline of a game they will play, say 0.7 (mostly positive), 0.8 (very positive), or 0.95 (overwhelmingly positive). I care about my time more than money so I would be happy with baseline 90-95% while someone that is happy digging more for games or trying games can input 70%. This input can also change what games someone will see. The niche, indie, etc. should show up higher on the list based on the user input, which i think is a fantastic side effect. This change would also potentially raise the score of some games depending on the null value that you are inputting. A game that has far more reviews a tad below the threshold could rank higher than one above the threshold but with far smaller review score. On average, you would actually like the one with the more reviews as the distribution would show that the score is more likely to be in the [0.8, 1.0] range than the game with less reviews. This is fairly similar to the current formula calculation with the SE of high review games lowering the raw rating less than those with small reviews but I think it's a better and far more elegant implementation.

The alternatives that I proposed of just adjusting the current formula rather than changing entirely to a p-value based scoring system is not thoroughly thought out and may not actually work.

I may just be just tapping into my obsessive particulate personality for this since if it works, you may not want to change it at all. I think that the interpretation of the scores is questionable for users of the site do not look at how the score is calculated. The deviation from something utilized in the statistical field to an unvalidated formula is something that bothers me as well. The method I am proposing by utilizing p-value I feel would lead to more accurate and helpful results. I can run a test with a random sample of games that the members of the SteamDB staff can provide so I can test my hypothesis (otherwise I have to learn how to tap into an API and do webscraping again in R, barf).

woctezuma commented 1 year ago

Thanks for your post!

I just chime in to remind xPaw that my latest suggestion was to use the Bayesian average. The formula is simple, the prior can be computed using the dataset, and gives a "baseline" of 0.7 if I remember correctly, and results would be similar to the current ones. Apart from Wikipedia, here is a nicely-written blog post about it.

In the past, I wrote some code to apply this formula to Steam and Epic Games store (no longer possible as ratingCount is hidden).

Feel free to share your thoughts about this other possible change.

hubertsng commented 1 year ago

Thanks for your post!

I just chime in to remind xPaw that my latest suggestion was to use the Bayesian average. The formula is simple, the prior can be computed using the dataset, and gives a "baseline" of 0.7 if I remember correctly, and results would be similar to the current ones. Apart from Wikipedia, here is a nicely-written blog post about it.

In the past, I wrote some code to apply this formula to Steam and Epic Games store (no longer possible as ratingCount is hidden).

Feel free to share your thoughts about this other possible change.

Just finished writing the post until I revisit it later and think to myself, what the hell was I typing. I don't have any experience with Bayesian average since I work in a field where it's testing superiority, inferiority, and just statistical significance in the clinical field. This background weakens my ability to know exactly what I am proposing as the amount of analyses that I run comparing the sample data vs an assumed null is sparse. The protocol for comparing sample results and modeling is something I am far more familiar with. Something that I do do in the professional field that leads me to make a suggestion is bootstrapping. This would lead the review score to be similarly comparable to the Steam raw score and increase interpretability. Haven't thought that one through so can't speak for the efficacy but I'm just tapping into my knowledge of statistical methods that are actually accepted.

The fact you are suggesting something that is Bayesian makes me want to use it since Bayesian is great haha. This may cause low review games to bump up far too high since games with 100% review score with 50 reviews could potentially be higher than high review games that are overwhelmingly positive or near the threshold.

I personally don't want to suggest anything is an improvement until a test run is made with a random sample provided by SteamDB. Really hoping that they care enough about this to want to change or they may run the route of if it ain't broke, don't fix it as the current system is an improvement over Steam raw and the previous Wilson lower bound.

woctezuma commented 1 year ago

I appreciate your input, and will make sure to read the update in your first post (and see what I can understand).

if it ain't broke, don't fix it as the current system is an improvement over Steam raw

I think that would be their point of view, unless there are concrete examples of the superiority of the new method over the current method (Torn's formula). Plus, there is a risk of a backlash from some devs or players if some of their favorite games lose ranks.

hubertsng commented 1 year ago

Plus, there is a risk of a backlash from some devs or players if some of their favorite games lose ranks.

Well that's on them for making a bad game :). Not actually though but personally my opinion is who cares, statistical accuracy is what's important. If your game drops then that's just because you didn't deserve the spot in the first place. Bit harsh but it's the truth if the best statistical approach is employed.

I think that would be their point of view, unless there are concrete examples of the superiority of the new method over the current method (Torn's formula).

That's why I'm hoping that I will be given a random sample to perform a test run and compare the difference between it and the current one. These results can then be sent to the team of SteamDB and though it's very subjective, I hope enough people would look at it and see which one they like. Single-blinded of course, right guys? Not just my hypothesis but I can run a bayesian average and any other suggestions that people have in the formula change. The 50% baseline is by far the easiest to adjust I would assume since having that as the null is not representative AT ALL for both how people rate games and the purpose of the ranking calculation. 70% is probably considered average rather than 50%, based on the American school system of 70% being the cutoff for a fail. You can see this when asking people to grade games. I rarely ever see anyone say an average game is a 5/10. This is something that is kind of inherently driven into our brains of that being average, even though it's piss poor. You are shortening the range of 10 numbers, with 5 being ones that people actually care about to the limitation of 3. You can add 0.5's but you're still decreasing accuracy for no good reason. When looking at scores that are 1-5 stars, you do see people considering the average to be 3 stars moreso than a score of 1-10.

woctezuma commented 1 year ago

I can run a test with a random sample of games that the members of the SteamDB staff can provide

For info, you could probably get all of the necessary data by printing the whole table at /stats/gameratings/, with ?all.

  1. Make sure to sign in and click this button first.

Bottom

  1. Change the number of games.

Picture

I suggest to use a browser other than Firefox as I think the table makes the browser very slow.

TornOne commented 8 months ago

One of the prerequisites for using one of the many statistical methods that guess the "real" distribution or some confidence interval based on a sample is that the sample is indicative of the thing you're trying to sample - that it's not biased. Therefore, it's important to take a step back and understand what the thing that we're trying to sample is, and what the sample is.

I believe the question we are trying to answer is "If a random Steam user played this game, what is the chance that they would like it?" (Because if you've already played it, you probably don't need anyone to tell you whether you'd like it or not.)
However, the sample we have is only of people who have A) bought and played the game, and B) decided they want to leave a review.

While I don't know how B would influence the ratings (Are people who decide to write reviews usually more positive or negative of the game than the average person?), it is quite obvious that people in group A are almost always more positive about the game than the overall populace. If you think you will not like the game, you will probably not buy it, and you will not be able to review it. Therefore, the sample of people who post reviews are much more positive about the game than the general populace, and probably more so the less popular a game is.

This is a heavy case of self-selection bias. Perhaps there are methods to somehow estimate the amount of bias and remove or lessen it, making for a more statistically accurate formula rather than whatever random stuff I came up with, but I am not aware of any such methods that work in this scenario.
I would be really interested if you find a better solution, but I don't believe simply taking the usual established statistical formulas that assume a non-biased sample work for us here, no matter how much we tweak the parameters.

If you want an environment to experiment with different formulas, I've a website where you can do that. Just find the "Advanced" tab from the sidebar, add a new formula, edit it to what you want, and switch to it. I'd love to know if you find something that works better on all games, not just the top / bottom / middle of the rankings.

hubertsng commented 8 months ago

One of the prerequisites for using one of the many statistical methods that guess the "real" distribution or some confidence interval based on a sample is that the sample is indicative of the thing you're trying to sample - that it's not biased. Therefore, it's important to take a step back and understand what the thing that we're trying to sample is, and what the sample is.

I believe the question we are trying to answer is "If a random Steam user played this game, what is the chance that they would like it?" (Because if you've already played it, you probably don't need anyone to tell you whether you'd like it or not.) However, the sample we have is only of people who have A) bought and played the game, and B) decided they want to leave a review.

While I don't know how B would influence the ratings (Are people who decide to write reviews usually more positive or negative of the game than the average person?), it is quite obvious that people in group A are almost always more positive about the game than the overall populace. If you think you will not like the game, you will probably not buy it, and you will not be able to review it. Therefore, the sample of people who post reviews are much more positive about the game than the general populace, and probably more so the less popular a game is.

This is a heavy case of self-selection bias. Perhaps there are methods to somehow estimate the amount of bias and remove or lessen it, making for a more statistically accurate formula rather than whatever random stuff I came up with, but I am not aware of any such methods that work in this scenario. I would be really interested if you find a better solution, but I don't believe simply taking the usual established statistical formulas that assume a non-biased sample work for us here, no matter how much we tweak the parameters.

If you want an environment to experiment with different formulas, I've a website where you can do that. Just find the "Advanced" tab from the sidebar, add a new formula, edit it to what you want, and switch to it. I'd love to know if you find something that works better on all games, not just the top / bottom / middle of the rankings.

I messed around on the site a little bit but I can't say I got very far. I'm not 100% sure on what programming language it is and how to best utilize it for if, log, and other statements. Hard to get around the games with only positive ratings and low review count without those statements.

I'm a little loss on what you are referring to A and B. My opinion on Wald is that I don't see a reason to have changed it from the established statistical formulas. I'm not sure on where the new formula is created but you can't create a new formula because the formula you were using had its assumptions violated. Generally for statistical papers, that's just something you write as a shortcoming of the statistical methods, that the assumption was violated. Of course for the papers, you try to get around it but I don't see a way how to do so.

For statistical research that have used sampling, the main way to deal with biased results on the statistical side (there's the study design side but there's no way for us to alter that) is to perform inverse probability weighting. That would use census data and factor in non response, both things that we are unable to do. After performing weighting on the data set, they would then perform normal statistical formulas. That's how I'm familiar with it being done.

I do think having a log of the total votes is an elegant way to get rid of more niche games but I'm still more of a fan of using normal statistical methods and making slight adjustments to that rather than creating a new formula. I am a personal fan of using a penalization factor for niche games, something that lessens up to a point for x number of reviews and then disappears. In math terms, have the limit of that penalization factor both be 0 when approaching from the left and right of x and 0 when greater than x. Or graphically, continuous that converges at 0 at review number x, whichever one makes sense more.

The issues I say that would come up with such a small discussion would just be bringing out our own personal biases. You can say this game should have a true review score of 98% and gravitate more towards a formula that does that while I say 96% and gravitate more towards a formula that does that. I think a better way to go about doing this isn't messing around on the site but more rather creating a statistical plan and cover our basis prior to running something and grab a large number of voices for input on the end results.

TornOne commented 8 months ago

I messed around on the site a little bit but I can't say I got very far. I'm not 100% sure on what programming language it is

It's just plain JavaScript.

I'm a little loss on what you are referring to A and B.

For each game, we want a representative sample from the group that is all Steam users. (We can call this group T.)
We don't have that.
We have group A, which is a subset of T, and only has people who thought they would like the game. (This is where most of the bias comes from.)
We also have group B, which is our sample, and a subset of A. This only has people who also wanted to leave a review. (This also introduces a bias, but I'd assume it's much less severe.)

The issues I say that would come up with such a small discussion would just be bringing out our own personal biases.

I agree that there's no absolute truth to be found here - we don't have the data for it. I just think that at the end of the day, math is just a tool, and if the goal is to make something people like, and established math creates a result people don't like, then we don't have a reason to use it.
I definitely don't think I made the best possible solution, but that's the extent of the effort I cared to put into it. Regardless, I'm not in control of any decisions anyways.

woctezuma commented 8 months ago

If you want an environment to experiment with different formulas, I've a website where you can do that. Just find the "Advanced" tab from the sidebar, add a new formula, edit it to what you want, and switch to it. I'd love to know if you find something that works better on all games, not just the top / bottom / middle of the rankings.

Nice, I can just plug some values and get the Bayesian Average!

Bayesian Average


const C = 1 ;
const m = 0.50 ;
const p = game.positiveVotes / game.votes ;
const alpha = (C + game.votes) ** Math.log10(m) ;
return alpha*m + (1-alpha)*p ;

Ranking


const C = 1117 ;
const m = 0.756 ;
return (C * m + game.positiveVotes) / (C + game.votes) ;

Ranking


const C = 17 ;
const m = 0.822 ;
return (C * m + game.positiveVotes) / (C + game.votes) ;

Ranking

woctezuma commented 7 months ago

I have a minor suggestion for Torn's website: make the custom scoring formulas clickable, as the presets are.

Picture

Indeed, if I want to compare SteamDB's formula to Bayesian Average (to confirm that I prefer the latter without any doubt), then I have to use the drop-down menu. It is a minor nitpick: I don't have to switch, I can be confident by now after so many checks. 😸

woctezuma commented 7 months ago

Also, you got the release date of Hearts of Iron IV wrong.

Torn's website Steam store

TornOne commented 7 months ago

I'm not sure if the issue tracker for SteamDB is the best place to discuss my site. You can message me on Discord (@tornone) if you want.

But since I'm already replying - If you look at the SteamDB history for HoI4, you can see they changed their release date back and forth 2 days ago. My site must have updated during that time frame. I don't have quite that deep of a knowledge of Steam as xPaw does, and I'm limited by how often I can update data on my site. It will fix itself in a day or two, I presume.

As for the formulas being clickable - Unlike presets, the state of which is observable elsewhere, I need a place that shows what the current formula is, hence the dropdown. If I keep the dropdown, but also make the custom additions clickable, then there's now a functional difference in how custom and non-custom formulae can be applied. If I remove the dropdown and add the non-custom ones as clickable buttons and also highlight the currently active one, I'm now consuming more room. There's a tradeoff to be made here, but I'll consider changing it. I have additions planned for the site in the indeterminate future anyways.