True positives: Training examples that belong to this class and are correctly labelled as this class by the classifier. For example, if the true label is 1, and the predicted label is 1, the example would be a true positive
False positives: Training examples that do not belong to this class and are incorrectly labelled as this class by the classifier. For example, if the true label is 0, and the predicted label is 1, the example would be a false positive.
True negatives: Training examples that do not belong to this class and are correctly labelled as not in this class by the classifier. For example, if the true label is 0, and the predicted label is 0, the example would be a true negative.
False negatives: Training examples that belong to this class and are incorrectly labelled as not in this class by the classifier. For example, if the true label is 1, and the predicted label is 0, the example would be a false negative.
Predicted 1
Predicted 0
True value 1
True positives
False negatives
True value 0
False positives
True negatives
There are several other metrics for comparing models that combine these counts into a single number:
Accuracy: is defined as (true postives + true negatives)/total examples, or the proportion of examples that were classifier correctly. This is an appropriate metric to use for a dataset where the positive and negative classes are equally balanced, but it can be misleading if the dataset is imbalanced.
Precision: is defined as `true positives/(true negatives + false positives), or the proportion of examples predicted to be in the positive class that were classified correctly. So if a classifier has high precision, most of the examples it predicts as belonging to the positive class will indeed belong to the positive class.
Recall: is defined as true positives/(true positives + false negatives), or the proportion of examples where the ground truth is positive that the classifier correctly identified. So if a classifier has high recall, it will correctly identify most of the examples that are truly in the positive class.
Regression Metrics: In a regression problem, the model predicts some numerical value for each training example, and this is compared with the actual value. Common regression metrics we can use in TFMA include:
Mean absolute error (MAE) #### Actions
Install
$ pip install tensorflow-model-analysis
It takes a saved model and an evaluation dataset as input.
First, the SavedModel must be converted to an EvalSharedModel:
- Next, we provide an EvalConfig. In this step, we tell TFMA what our label is, provide any specifications for slicing the model by one of the features, and stipulate all the metrics we want TFMA to calculate and display:
```python
eval_config = tfma.EvalConfig(
model_specs=[tfma.ModelSpec(label_key='consumer_disputed')],
slicing_specs=[tfma.SlicingSpec()],
metrics_specs=[
tfma.MetricsSpec(metrics=[
tfma.MetricConfig(class_name="BinaryAccuracy"),
tfma.MetricConfig(class_name="ExampleCount"),
tfma.MetricConfig(class_name="FalsePositives"),
tfma.MetricConfig(class_name="TruePositives"),
tfma.MetricConfig(class_name="FalseNegatives"),
tfma.MetricConfig(class_name="TrueNegatives")
])
]
)
TFMA works as previously described in a Google Colab notebook. But a few extra steps are required to view the visualizations in a standalone Jupyter Notebook. Install and enable the TFMA notebook extension with:
We can also use TFMA to compare our metrics across multiple models. For example, these may be the same model trained on different datasets, or two models with different hyperparameters trained on the same dataset.
For the models we compare, we first need to generate an eval_result similar to the preceding code examples. We need to ensure we specify an output_path location to save the models. We use the same EvalConfig for both models so that we can calculate the same metrics:
- Then, we load them using the following code:
```python
eval_results_from_disk = tfma.load_eval_results(
[_EVAL_RESULT_LOCATION, _EVAL_RESULT_LOCATION_2]
)
In the next sections, we will describe how to use three projects for evaluating fairness in TensorFlow: TFMA, Fariness Indicators, and the What-If Tool.
Slicing Model Predictions in TFMA
The first step in evaluating your machine learning model for fairness is slicing your model's predictions by the groups you are interested in-for example, gender, race, or country. These slices can be generated by TFMA or the Fairness Indicators tools.
To slice data in TFMA, a slicing column must be provided as a SliceSpec.
Next, we use TFMA to evaluate the model and ask it to calculate metrics for a set of decision thresholds we supply. This is supplied to TFMA in the metrics_spec argument for the EvalConfig, along with any other metrics we wish to calculate:
- Going Deeper with the What-If tool
- We can install the WIT with:
$ pip install witwidget
- Next, we create a TFRecordDataset to load the data file. We sample 1,000 training examples and convert it to a list of TFExamples. The visualizations in the What-If-Tool work well with this number of training examples, but they get harder to understand with a larger sample:
```python
eval_data = tf.data.TFRecordDataset(_EVAL_DATA_FILE, compression_type="GZIP")
subset = eval_data.take(1000)
eval_examples = [tf.train.Example.FromString(d.numpy()) for d in subset]
Next, we load the model and define a prediction function that takes in the list of TFExamples and returns the model's predictions:
model = tf.saved_model.load(export_dir=_MODEL_DIR)
predict_fn = model.signatures['serving_default']
def predict(test_examples):
test_examples = tf.constant([example.SerializeToString() for example in test_examples])
preds = predict_fn(examples=test_examples)
return preds['outputs'].numpy()
- Then we configure the WIT using:
```python
from witwidget.notebook.visualization import WitConfigBuilder
config_builder = WitConfigBuilder(eval_examples).set_custom_predict_fn(predict)
And we can view it in a notebook using:
from witwidget.notebook.visualization import WitWidget
WitWidget(config_builder)
PDPs show the change in prediction results (the inference score) for different valid values of a feature. There is no change in the inference score across the company feature, showing that the predictions for this data point don't depend on the value of this feature. But for the company_response feature, there is a change in the inference score, which shows that the model prediction has some dependence on the value of the feature.
Description
(true postives + true negatives)/total examples
, or the proportion of examples that were classifier correctly. This is an appropriate metric to use for a dataset where the positive and negative classes are equally balanced, but it can be misleading if the dataset is imbalanced.true positives/(true positives + false negatives)
, or the proportion of examples where the ground truth is positive that the classifier correctly identified. So if a classifier has high recall, it will correctly identify most of the examples that are truly in the positive class.import tensorflow_model_analysis as tfma import tensorflow as tf
eval_shared_model = tfma.default_eval_shared_model( eval_saved_model_path=_MODEL_DIR, tags=[tf.saved_model.SERVING] )
eval_result_2 = tfma.run_model_analysis( eval_shared_model=eval_shared_model_2, eval_config=eval_config, data_location=_EVAL_DATA_FILE, output_path=_EVAL_RESULT_LOCATION_2, file_format='tfrecords' )
SingleSliceSpec
with no specified arguments returns the entire dataset/writer = tf.summary.create_file_writer('./fairness_indicator_logs') with writer.as_default(): summary_v2.FairnessIndicators('./eval_result_fairness', step=1) writer.close()
%load_ext tensorboard %tensorboard --logdir=./fairness_indicator_logs
$ pip install witwidget
def predict(test_examples): test_examples = tf.constant([example.SerializeToString() for example in test_examples]) preds = predict_fn(examples=test_examples) return preds['outputs'].numpy()
Estimate
Tests