Implement new calculator util: calculate_composite_score() for predict_ml() pipeline

Composite Score Calculation

The composite score approach is used in the model_recommendation_core(), which is part of the predict_ml() pipeline for recommending the best n model(s).

It aims to synthesize multiple scoring metrics into a single metric that can be used to compare and rank models. This is particularly useful when you have multiple criteria that you consider important for your model's performance, and these criteria might have different scales or directions (i.e., for some metrics, higher is better, while for others, lower is better).

Here’s a breakdown of the calculation:

Weight Assignment: Each metric is assigned a weight based on its importance. A higher weight (e.g., 5) is given to prioritized metrics, while a standard weight (e.g., 1) is assigned to others. This allows for the emphasis on metrics deemed more critical to the specific problem or domain.
Score Adjustment and Weighted Sum: Each metric's score is multiplied by its corresponding weight. If a metric benefits from being low (like RMSE), its score could be inverted (e.g., 1/score or a similar transformation) before applying the weight to align it with the "higher is better" principle. Then, these weighted scores are summed up to produce a composite score for each model.
Normalization: The sum of the weighted scores is divided by the sum of the weights. This normalization step ensures that the composite score is not unfairly influenced by the number of metrics or their assigned weights.

Composite Score Calculation Formula

Given a set of metrics $M$, where each metric $m \in M$ has a score $s_m$ and a weight $w_m$, the composite score $C$ for a model can be calculated as:

$$ C = \frac{\sum_{m \in M} (w_m \cdot \text{adj}(sm))}{\sum{m \in M} w_m} $$

Where:

$\text{adj}(s_m)$ is the adjusted score for metric $m$, ensuring all scores are aligned to the "higher is better" principle. For metrics where a lower score is traditionally better, this could involve inversion or negation.
$w_m$ is the weight assigned to metric $m$, reflecting its importance. Prioritized metrics receive higher weights.
The denominator, $\sum_{m \in M} w_m$, normalizes the composite score, ensuring it's not biased by the number of metrics or the magnitude of their weights.

This formula allows for a weighted synthesis of multiple performance metrics into a single, normalized score that facilitates direct comparison of models based on a balanced assessment of their performance across the prioritized criteria.

Concern: Handling Metrics Where Lower is Better

The concern about metrics where a lower value indicates better performance (like RMSE) is in my mind. However, the composite score calculation can accommodate such metrics through inversion or negation, ensuring that all metrics effectively operate in a "higher is better" framework for the composite score to be meaningful and consistent.

Inversion: For a metric where lower is better, one approach is to invert the score (e.g., 1 / score). This transformation means that a lower original score (which is better) results in a higher inverted score, aligning it with the composite score logic.
Negation: Another approach is to use negation, especially if the scoring function directly supports it (e.g., neg_mean_squared_error). The negative value ensures that optimization routines aiming to maximize the score are consistent across all metrics.

When integrating such scores into the composite score calculation, the key is to ensure all metrics are on a consistent scale and direction so that the composite score effectively reflects the model's overall performance according to the prioritized criteria.

This approach allows for a nuanced comparison of models, balancing the trade-offs between different performance metrics in a way that aligns with the specific objectives and preferences for the modelling task at hand.

Review of Metrics' Adherence to 'Higher is Better' Framework

Classification Metrics

Accuracy: Higher is inherently better.
Balanced Accuracy: Higher is inherently better.
Average Precision: Higher is inherently better.
Neg Brier Score: Using "neg_" prefix to ensure higher is better, although traditionally Brier score is lower for better.
F1 (Micro): Higher is inherently better.
F1 (Macro): Higher is inherently better.
F1 (Weighted): Higher is inherently better.
Neg Log Loss: Using "neg_" prefix to ensure higher is better, although traditionally log loss is lower for better.
Precision (Micro): Higher is inherently better.
Precision (Macro): Higher is inherently better.
Precision (Weighted): Higher is inherently better.
Recall (Micro): Higher is inherently better.
Recall (Macro): Higher is inherently better.
Recall (Weighted): Higher is inherently better.
Jaccard (Micro): Higher is inherently better.
Jaccard (Macro): Higher is inherently better.
Jaccard (Weighted): Higher is inherently better.
ROC AUC (OVR): Higher is inherently better.
ROC AUC (OVO): Higher is inherently better.

Regression Metrics

EV (Explained Variance): Higher is inherently better.
MAE (Mean Absolute Error): Using "neg_" prefix to ensure higher is better.
MSE (Mean Squared Error): Using "neg_" prefix to ensure higher is better.
RMSE (Root Mean Squared Error): Using "neg_" prefix to ensure higher is better.
MSLE (Mean Squared Log Error): Using "neg_" prefix to ensure higher is better.
MedAE (Median Absolute Error): Using "neg_" prefix to ensure higher is better.
R2: Higher is inherently better.
MPD (Mean Poisson Deviance): Using "neg_" prefix to ensure higher is better.
MGD (Mean Gamma Deviance): Using "neg_" prefix to ensure higher is better.
MAPE (Mean Absolute Percentage Error): Using "neg_" prefix to ensure higher is better.

Implementation Summary

calculate_composite_score() is a utility function that aggregates multiple evaluation metrics into a single score by weighting each metric according to its importance. This weighted approach facilitates a balanced and comprehensive evaluation of model performance across various criteria, enhancing decision-making in model selection within the predict_ml() pipeline.

Code Breakdown

Input Validation
- Ensures the scores and metric_weights inputs are dictionaries, checks for their non-emptiness, and confirms that all metrics have corresponding weights.

if not isinstance(scores, dict) or not isinstance(metric_weights, dict):
    raise TypeError("Both 'scores' and 'metric_weights' must be dictionaries.")
if not scores or not metric_weights:
    raise ValueError("'scores' and 'metric_weights' cannot be empty.")
missing_metrics = set(scores.keys()) - set(metric_weights.keys())
if missing_metrics:
    raise ValueError(f"Missing weights for metrics: {', '.join(missing_metrics)}")

Composite Score Calculation
- Computes the composite score as a weighted average of provided metric scores, where the weights reflect the relative importance of each metric. The function accommodates metrics where lower values are preferable by applying necessary adjustments before scoring.

composite_score = sum(score * metric_weights.get(metric, 0) for metric, score in scores.items()) / sum(metric_weights.values())

Error Handling in Calculation
- Captures and raises any errors that occur during the computation of the composite score, providing clear feedback on the nature of the error.

try:
    composite_score = sum(score * metric_weights.get(metric, 0) for metric, score in scores.items()) / sum(metric_weights.values())
except Exception as e:
    raise ValueError(f"Error in calculating composite score: {e}")

Example Usage

Calculate a composite score for a set of model performance metrics with specified weights.

scores = {'Accuracy': 0.95, 'Precision': 0.90}
metric_weights = {'Accuracy': 5, 'Precision': 1}
composite_score = calculate_composite_score(scores, metric_weights)
print(f"Composite Score: {composite_score:.2f}")

ETA444 / datasafari