Compute "chance agreement" baseline

BlueBrain / morphoclass

Neuronal morphology preparation and classification using Machine Learning.

Apache License 2.0

8 stars 4 forks source link

Background

The following idea was inspired by the term p_e ("probability of chance agreement") in Cohen's Kappa definition.

Formula for "chance agreement"

For a (multi-class) classification problem, let's consider the vector of ground-truth labels y_true. If we assume that the dataset represents accurately the proportions of each label, we can then say that the probability of any given sample to have label k (for k in 1...K) is:

$https://latex.codecogs.com/svg.image?\hat{p}_k = \frac{n_k}{N}$

where n_k is the number of samples in y_true with label equal to k, and N is the total number of samples in y_true.

Based on this observation, let's consider a model that predicts y_pred by attributing to each sample, independently from the other samples, a random label according to the observed occurrence probabilities. This means that the predicted label of the i-th sample, y_pred[i] is given by

$https://latex.codecogs.com/svg.image?y_{pred}[i] \sim \text{Categorical}(\hat{p}_1, ..., \hat{p_k}) = \text{Categorical}\left(\frac{n_1}{N}, ..., \frac{n_K}{N}\right)$

Then, the probability of the event y_true[i] == y_pred[i] is computed using the Law of Total Probability as

$https://latex.codecogs.com/svg.image?p(y_{pred}[i] = y_{true}[i])= \sum_{k=1}^K\left(\frac{n_k}{N}\right)^2$

Actions

[ ] Implement "chance accuracy" as a metric in make_performance_table()

Test approach using DummyClassifier(strategy="stratified")

The sklearn model DummyClassifier(strategy="stratified") gives predictions by taking (random!) samples from the distribution

To test if our formula above is correct, we can take many samples from this dummy classifier, and check if it corresponds (up to random error due to Monte Carlo approximation) to the formula we used in our implementation.

Script

from __future__ import annotations import numpy as np from sklearn.dummy import DummyClassifier from sklearn.metrics import accuracy_score from morphoclass.metrics import chance_agreement N_ELEMENTS = 10 N_MONTECARLO_SAMPLES = 10_000 np.random.seed(42) y = np.random.randint(low=0, high=3, size=N_ELEMENTS) x = np.zeros_like(y).reshape((-1, 1)) model = DummyClassifier(strategy="stratified") model.fit(x, y) accs = [ accuracy_score(y_true=y, y_pred=model.predict(x)) for _ in range(N_MONTECARLO_SAMPLES) ] acc_montecarlo = np.mean(accs) acc_theory = chance_agreement(y) print(f"Expected mean chance agreement: {acc_theory:.3f}") print(f"Observed mean chance agreement: {acc_montecarlo:.3f}")

BlueBrain / morphoclass