Closed FrancescoCasalegno closed 2 years ago
DummyClassifier(strategy="stratified")
The sklearn
model DummyClassifier(strategy="stratified") gives predictions by taking (random!) samples from the distribution
To test if our formula above is correct, we can take many samples from this dummy classifier, and check if it corresponds (up to random error due to Monte Carlo approximation) to the formula we used in our implementation.
from __future__ import annotations
import numpy as np
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score
from morphoclass.metrics import chance_agreement
N_ELEMENTS = 10
N_MONTECARLO_SAMPLES = 10_000
np.random.seed(42)
y = np.random.randint(low=0, high=3, size=N_ELEMENTS)
x = np.zeros_like(y).reshape((-1, 1))
model = DummyClassifier(strategy="stratified")
model.fit(x, y)
accs = [
accuracy_score(y_true=y, y_pred=model.predict(x))
for _ in range(N_MONTECARLO_SAMPLES)
]
acc_montecarlo = np.mean(accs)
acc_theory = chance_agreement(y)
print(f"Expected mean chance agreement: {acc_theory:.3f}")
print(f"Observed mean chance agreement: {acc_montecarlo:.3f}")
Expected mean chance agreement: 0.460
Observed mean chance agreement: 0.462
Background
The following idea was inspired by the term
p_e
("probability of chance agreement") in Cohen's Kappa definition.Formula for "chance agreement"
For a (multi-class) classification problem, let's consider the vector of ground-truth labels
y_true
. If we assume that the dataset represents accurately the proportions of each label, we can then say that the probability of any given sample to have labelk
(fork in 1...K
) is:where
n_k
is the number of samples iny_true
with label equal tok
, andN
is the total number of samples iny_true
.Based on this observation, let's consider a model that predicts
y_pred
by attributing to each sample, independently from the other samples, a random label according to the observed occurrence probabilities. This means that the predicted label of thei
-th sample,y_pred[i]
is given byThen, the probability of the event
y_true[i] == y_pred[i]
is computed using the Law of Total Probability asActions
make_performance_table()