User statistics: proper scoring rule

I would like user pages to include a more precise estimate of a user's quality - using a proper scoring rule. The simplest is the log scoring rule, which is very easy to implement. Here is a version in Haskell, excerpted from my Nootropics essay where I am judging my Adderall predictions:

you 'earn' the logarithm of the probability if you were right, and the logarithm of the negation if you were wrong; he who racks up the fewest negative points wins. We feed in a list and get back a number:

logScore ps = sum $ map (\(result,p) -> if result then log p else log (1-p)) ps
logScore [(True,0.95),(False,0.30),(True,0.85),(True,0.75),(False,0.50),(False,0.25),(False,0.60),(True,0.70),(True,0.65),(True,0.60),(False,0.30),(True,0.50),(True,0.90),(True,0.40)]
~> -6.125

In this case, a blind guesser would guess 50% every time (roughly half the days were Adderall and roughly half were not) so the question is, did the 50% guesser beat me?

(15 * log 0.5)
~> -10.397
(15 * log 0.5) > logScore [(True,0.95),(False,0.30),(True,0.85),(True,0.75),(False,0.50),(False,0.25),(False,0.60),(True,0.70),(True,0.65),(True,0.60),(False,0.30),(True,0.50),(True,0.90),(True,0.40)]
~> False

So I had a palpable edge over the random guesser, although the sample size is not fantastic.

The best way would be to divide the log score of the equivalent number of 50% guesses - since every prediction on PB.com is a binary prediction - by the user's actual log score. If you scored, say, -15 and the random guesser scored -20, then -20/-15=1.3. Higher is better; if later the random guesser has -25 and you have -17, you did even better since now you earned 1.47.

Scaling this up to PB users seems easy: for the n judged predictions, sum the logs of the last probability, and then divide with n * log 0.5.

Display-wise, this is easy: just report the number. In a user page like http://predictionbook.com/users/gwern one can probably just tack on an additional column to the row: like

| Score
| 1.47

I would also like this.

Scaling this up to PB users seems easy: for the n judged predictions, sum the logs of the last probability, and then divide with n * log 0.5.

Is summing the logs of every probability better than the last probability? If you estimate an event 5 days out, then 4 days out, etc. each of those should pay out, not just the last one.

I may be displaying my ignorance here… I wonder whether we can come up with a more interesting comparison value. I'm probably hunting over well trodden ground, so feel free to point me in the direction of the relevant prior art, but if I understand the math:

the log score on its own doesn't tell you much - you can only ever lose points, so the dominant strategy is not to play;
comparing to the random guesser as you've suggested is hopefully a fairly easy bar to get over - you only have to have real information to win;
comparing to the average pbook score might be more interesting - if across all pbook users the average prediction scores a -0.6, divide with n × -0.6, but this punishes you being a perfectly calibrated predictor with low information;
we could compare you to a perfectly calibrated predictor with imperfect information - would a prediction of 70% from a perfectly calibrated predictor score an average of 0.7 × ln(0.7) + 0.3 × ln(0.3) = -0.61? … here comparing to perfection is tough, but at least your average score should be a constant measure of your calibration (right?);
we could compare you to the average pbook user's score against 4.

Are any of these interesting enough you can turn them into a good idea?

I find it more natural to use log score with base-2 logs so scores can be interpreted as "bits". And a scoring function with 1+log_2(p) for correct, and 1+log_2(1-p) for incorrect.

This is equivalent to subtracting the score for the "blind guesser" in the scoring function. It also has the nice behaviour that higher is better and +ve means doing better than "random". (Interestingly, the score here is also the number of bits better than a "null-compressor" that you could encode the outcomes in.)

We have also used Matt's idea (4) above as an ancillary measure we called boldness. The interpretation being that a user with a positive boldness is, on average, tipping with higher probabilities than they should for an optimal score.

We have been using this for tipping football results for quite some time.

I like the idea.

I find it more natural to use log score with base-2 logs so scores can be interpreted as "bits". And a scoring function with 1+log_2(p) for correct, and 1+log_2(1-p) for incorrect.

I agree that this proposed scoring function with binary logs seems more natural.

In addition to what has been proposed, I think it may be interesting to compare user performance to that of a generalized guesser that computes a probability $p_\text{guesser}$ based on some information $\mathbtif{I}$ available to it, giving the scoring rule $\log_2(p) - \log_2(p_\text{guesser}(\mathbfit{I}))$ .

A particular class of functions that I have in mind for $p_\text{guesser}$ is one which combines the probability estimates given by other users. In other words, we have a vector $\mathbfit{I\in[0,1]^n}$ of probability estimates given by $n$ other users, with $p_\text{guesser}:[0,1]^n \to [0,1]$ being some probability aggregation function. A quick search yields a highly cited review on combining probability distributions by Clemen and Winkler (1999), with section 2.2.1 being relevant (they give more details in their 1990 paper). There is no universally superior function; the simplest function is a weighted average, which they call the Bernoulli model.

By restricting $\mathbtif{I}$ to the probabilities given by other users before a specific user has given his estimate, $p_\text{guesser}$ becomes a guesser that takes into account only information available to that user as he makes his prediction.

Another idea is to scale the scoring function based on some measure of the difficulty of the prediction. Again, the obvious data to use are the predictions by other users. I haven't yet looked at the literature, so I don't know what sort of function would be reasonable. (Intuitively, correctly anticipating an event that others have predicted with probabilities {0.1, 0.2, 0.15} is more impressive than one with {0.8, 0.9, 0.85}. But, hmm... this is sort of taken into account by what I proposed above.)

Google+ discussion: https://plus.google.com/u/0/103530621949492999968/posts/AVk4tGYibVP

Is summing the logs of every probability better than the last probability? If you estimate an event 5 days out, then 4 days out, etc. each of those should pay out, not just the last one.

I don't really know. Isn't PB currently using only the last prediction?

the log score on its own doesn't tell you much - you can only ever lose points, so the dominant strategy is not to play;

Well, one could say that of calibration too: with every prediction that isn't perfectly calibrate, you are losing calibration points. (If you blow one 0/100% prediction, no number of predictions will restore your original perfect calibration.)

comparing to the random guesser as you've suggested is hopefully a fairly easy bar to get over - you only have to have real information to win;

Sure. But again, this is also true of calibration - all you have to do is not be over or underconfident, and you need some information in order to be calibrated for any decile other than 50%!

This is why I included the random guesser: to provide a nicely increasing number people can feel happy or sad about, and one which gives some sort of comparability across users.

we could compare you to a perfectly calibrated predictor with imperfect information - would a prediction of 70% from a perfectly calibrated predictor score an average of 0.7 × ln(0.7) + 0.3 × ln(0.3) = -0.61? … here comparing to perfection is tough, but at least your average score should be a constant measure of your calibration (right?);

I don't know how that would work... you mean, take every prediction of yours by decile and compare it against a random predictor with the base rate of that decile? Not sure that's legit.

drpowell: interesting, I didn't know that was what it was equivalent to. Does using base-2 with those scores have a name or is it just generally understood by stats/information-theory folks that that is what one is supposed to be doing?

'Boldness' sounds kind of useful, but a more advanced metric than PB currently has, so I think it'd be better to start with something more immediate. (Ditto for vyu's suggestion.)

bellroy / predictionbook-legacy

User statistics: proper scoring rule #32