BlueBrain / morphoclass

Neuronal morphology preparation and classification using Machine Learning.
https://morphoclass.readthedocs.io
Apache License 2.0
8 stars 4 forks source link

Compute "chance agreement" baseline #49

Closed FrancescoCasalegno closed 2 years ago

FrancescoCasalegno commented 2 years ago

Background

The following idea was inspired by the term p_e ("probability of chance agreement") in Cohen's Kappa definition.

Formula for "chance agreement"

For a (multi-class) classification problem, let's consider the vector of ground-truth labels y_true. If we assume that the dataset represents accurately the proportions of each label, we can then say that the probability of any given sample to have label k (for k in 1...K) is:

where n_k is the number of samples in y_true with label equal to k, and N is the total number of samples in y_true.

Based on this observation, let's consider a model that predicts y_pred by attributing to each sample, independently from the other samples, a random label according to the observed occurrence probabilities. This means that the predicted label of the i-th sample, y_pred[i] is given by

Then, the probability of the event y_true[i] == y_pred[i] is computed using the Law of Total Probability as

Actions

FrancescoCasalegno commented 2 years ago

Test approach using DummyClassifier(strategy="stratified")

The sklearn model DummyClassifier(strategy="stratified") gives predictions by taking (random!) samples from the distribution

To test if our formula above is correct, we can take many samples from this dummy classifier, and check if it corresponds (up to random error due to Monte Carlo approximation) to the formula we used in our implementation.

Script

from __future__ import annotations

import numpy as np
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score

from morphoclass.metrics import chance_agreement

N_ELEMENTS = 10
N_MONTECARLO_SAMPLES = 10_000

np.random.seed(42)

y = np.random.randint(low=0, high=3, size=N_ELEMENTS)
x = np.zeros_like(y).reshape((-1, 1))

model = DummyClassifier(strategy="stratified")
model.fit(x, y)

accs = [
    accuracy_score(y_true=y, y_pred=model.predict(x))
    for _ in range(N_MONTECARLO_SAMPLES)
]

acc_montecarlo = np.mean(accs)
acc_theory = chance_agreement(y)

print(f"Expected mean chance agreement: {acc_theory:.3f}")
print(f"Observed mean chance agreement: {acc_montecarlo:.3f}")

Output

Expected mean chance agreement: 0.460
Observed mean chance agreement: 0.462