This PR adds a shap command to the internal CLI to help explain a specific (per-sql) XGBoost prediction.
Usage:
python qualx_main.py shap --help
Example:
python --platform $PLATFORM \
--prediction_output /path/to/prediction/output \
--index 0
# --model $MODEL # optional
# --index should be a numeric zero-based index pointing to a specific line (i.e. sqlID) in the `shap_values.csv` file.
# Each line in this file corresponds to the same line (sqlID) in the `per_sql.csv` file.
the features are listed in order of importance (absolute value of shap_value), similar to a SHAP waterfall plot.
model_rank shows the feature importance rank on the training set.
model_shap_value shows the feature shap_value on the training set.
train_[mean|std|min|max] show the mean, standard deviation, min and max values of the feature in the training set.
train_[25%|50%|75%] show the feature value at the respective percentile in the training set.
feature_value shows the value of the feature used in prediction (for the indexed row/sqlID).
out_of_range indicates if the feature_value used in prediction was outside of the range of values seen in the training set.
Shap base value is the model's average prediction across the entire training set.
Shap values sum is the sum of the shap_value column for this indexed instance.
Shap prediction is the sum of Shap base value and Shap values sum, representing the model's predicted value.
exp(prediction) is the exponential of Shap prediction, which represents the predicted speedup (since the XGBoost model currently predicts log(speedup)).
the predicted speedup (which should match y_pred in per_sql.csv) is applied to the "supported" durations and combined with the unsupported" durations to produce a final per-sql speedup (speedup_pred in per_sql.csv).
Changes
Added features.csv to save the feature values used for prediction.
Moved the current shap_values.csv to feature_importance.csv (which is more descriptive of its purpose).
Used shap_values.csv to save all of the shap values per feature per instance/sqlID during prediction.
Saved a model.metrics file (for each model) during training to store the feature shap values and distribution metrics of the training set.
Renamed the model.json.cfg files to model.cfg to avoid the double-suffix.
Refactored/combined the compute_feature_importance and compute_shapley_values functions.
Updated internal predict CLI to support --qual_output argument.
Added shap command to internal CLI, which joins the prediction shap_values w/ training shap_values and distribution metrics.
This PR adds a
shap
command to the internal CLI to help explain a specific (per-sql) XGBoost prediction.Usage:
Example:
The output of the command looks like:
Where:
shap_value
), similar to a SHAP waterfall plot.model_rank
shows the feature importance rank on the training set.model_shap_value
shows the feature shap_value on the training set.train_[mean|std|min|max]
show the mean, standard deviation, min and max values of the feature in the training set.train_[25%|50%|75%]
show the feature value at the respective percentile in the training set.feature_value
shows the value of the feature used in prediction (for the indexed row/sqlID).out_of_range
indicates if thefeature_value
used in prediction was outside of the range of values seen in the training set.Shap base value
is the model's average prediction across the entire training set.Shap values sum
is the sum of theshap_value
column for this indexed instance.Shap prediction
is the sum ofShap base value
andShap values sum
, representing the model's predicted value.exp(prediction)
is the exponential ofShap prediction
, which represents the predicted speedup (since the XGBoost model currently predictslog(speedup)
).y_pred
inper_sql.csv
) is applied to the "supported" durations and combined with the unsupported" durations to produce a final per-sql speedup (speedup_pred
inper_sql.csv
).Changes
features.csv
to save the feature values used for prediction.shap_values.csv
tofeature_importance.csv
(which is more descriptive of its purpose).shap_values.csv
to save all of the shap values per feature per instance/sqlID during prediction.model.metrics
file (for each model) during training to store the feature shap values and distribution metrics of the training set.model.json.cfg
files tomodel.cfg
to avoid the double-suffix.compute_feature_importance
andcompute_shapley_values
functions.--qual_output
argument.shap
command to internal CLI, which joins the prediction shap_values w/ training shap_values and distribution metrics.Test
Following CMDs have been tested:
External Usage:
Internal Usage: