Model Analysis and Validation

Description

Classification Metrics
- True positives: Training examples that belong to this class and are correctly labelled as this class by the classifier. For example, if the true label is 1, and the predicted label is 1, the example would be a true positive
- False positives: Training examples that do not belong to this class and are incorrectly labelled as this class by the classifier. For example, if the true label is 0, and the predicted label is 1, the example would be a false positive.
- True negatives: Training examples that do not belong to this class and are correctly labelled as not in this class by the classifier. For example, if the true label is 0, and the predicted label is 0, the example would be a true negative.
- False negatives: Training examples that belong to this class and are incorrectly labelled as not in this class by the classifier. For example, if the true label is 1, and the predicted label is 0, the example would be a false negative.

	Predicted 1	Predicted 0
True value 1	True positives	False negatives
True value 0	False positives	True negatives

There are several other metrics for comparing models that combine these counts into a single number:
- Accuracy: is defined as (true postives + true negatives)/total examples, or the proportion of examples that were classifier correctly. This is an appropriate metric to use for a dataset where the positive and negative classes are equally balanced, but it can be misleading if the dataset is imbalanced.
- Precision: is defined as `true positives/(true negatives + false positives), or the proportion of examples predicted to be in the positive class that were classified correctly. So if a classifier has high precision, most of the examples it predicts as belonging to the positive class will indeed belong to the positive class.
- Recall: is defined as true positives/(true positives + false negatives), or the proportion of examples where the ground truth is positive that the classifier correctly identified. So if a classifier has high recall, it will correctly identify most of the examples that are truly in the positive class.
Regression Metrics: In a regression problem, the model predicts some numerical value for each training example, and this is compared with the actual value. Common regression metrics we can use in TFMA include:
- Mean absolute error (MAE)
  #### Actions
Install
```
$ pip install tensorflow-model-analysis
```
It takes a saved model and an evaluation dataset as input.

First, the SavedModel must be converted to an EvalSharedModel:


_MODEL_DIR = trainer.outputs.model.get()[0].uri + '/Format-Serving'
# _EVAL_DATA_FILE = example_gen.outputs.examples.get()[0].uri + '/Split_eval/data_tfrecord-00000-of-00001.gz'
_EVAL_DATA_FILE = '/content/tfx/CsvExampleGen/examples/1/Split-eval/data_tfrecord-00000-of-00001.gz'
_EVAL_RESULT_LOCATION = './eval_result_1000_steps'

import tensorflow_model_analysis as tfma import tensorflow as tf

eval_shared_model = tfma.default_eval_shared_model( eval_saved_model_path=_MODEL_DIR, tags=[tf.saved_model.SERVING] )

- Next, we provide an EvalConfig. In this step, we tell TFMA what our label is, provide any specifications for slicing the model by one of the features, and stipulate all the metrics we want TFMA to calculate and display:
```python
eval_config = tfma.EvalConfig(
    model_specs=[tfma.ModelSpec(label_key='consumer_disputed')],
    slicing_specs=[tfma.SlicingSpec()],
    metrics_specs=[
      tfma.MetricsSpec(metrics=[
        tfma.MetricConfig(class_name="BinaryAccuracy"),
        tfma.MetricConfig(class_name="ExampleCount"),
        tfma.MetricConfig(class_name="FalsePositives"),
        tfma.MetricConfig(class_name="TruePositives"),
        tfma.MetricConfig(class_name="FalseNegatives"),
        tfma.MetricConfig(class_name="TrueNegatives")
      ])
    ]
)

Then, run the model analysis step:

eval_result = tfma.run_model_analysis(
eval_shared_model=eval_shared_model,
eval_config=eval_config,
data_location=_EVAL_DATA_FILE,
output_path=_EVAL_RESULT_LOCATION,
file_format='tfrecords'
)

And view the results in a Jupyter Notebook:

tfma.view.render_slicing_metrics(eval_result)

Analyzing TFLite Models

We can also analyze TFLite models in TFMA. In this case, the model type must be passed to the ModelSpec:

eval_config = tfma.EvalConfig(
model_specs=[tmfa.ModelSpec(label_key='my_label',
                          model_type=tfma.TF_LITE)],
...
)

Then, run the model analysis step:

eval_result = tfma.run_model_analysis(
eval_shared_model=eval_shared_model,
eval_config=eval_config,
data_location=_EVAL_DATA_FILE,
output_path=_EVAL_RESULT_LOCATION,
file_format='tfrecords')

And view the results in a Jupyter Notebook:

tfma.view.render_slicing_metrics(eval_result)

Using TFMA in a Jupyter Notebook
TFMA works as previously described in a Google Colab notebook. But a few extra steps are required to view the visualizations in a standalone Jupyter Notebook. Install and enable the TFMA notebook extension with:
```
$ jupyter nbextension enable --py widgetsnbextension
$ jupyter nbextension install --py \
--symlink tensorflow_model_analysis
$ jupyter nbextension enable --py tensorflow_model_analysis
```

All the metrics

metrics_specs = [
tfma.MetricsSpec(metrics=[
tfma.MetricConfig(class_name='BinaryAccuracy'),
tfma.MetricConfig(class_name='AUC'),
tfma.MetricConfig(class_name='ExampleCount'),
tfma.MetricConfig(class_name='Precision'),
tfma.MetricConfig(class_name='Recall')
])
]

Analyzing Multiple Models in TFMA
We can also use TFMA to compare our metrics across multiple models. For example, these may be the same model trained on different datasets, or two models with different hyperparameters trained on the same dataset.
For the models we compare, we first need to generate an eval_result similar to the preceding code examples. We need to ensure we specify an output_path location to save the models. We use the same EvalConfig for both models so that we can calculate the same metrics:
```
eval_shared_model_2 = tfma.default_eval_shared_model(
eval_saved_model_path=_MODEL_DIR,
tags=[tf.saved_model.SERVING]
)
```

eval_result_2 = tfma.run_model_analysis( eval_shared_model=eval_shared_model_2, eval_config=eval_config, data_location=_EVAL_DATA_FILE, output_path=_EVAL_RESULT_LOCATION_2, file_format='tfrecords' )

- Then, we load them using the following code:
```python
eval_results_from_disk = tfma.load_eval_results(
    [_EVAL_RESULT_LOCATION, _EVAL_RESULT_LOCATION_2]
)

And we can visualize them using:

tfma.view.render_time_series(eval_results_from_disk)

In the next sections, we will describe how to use three projects for evaluating fairness in TensorFlow: TFMA, Fariness Indicators, and the What-If Tool.
Slicing Model Predictions in TFMA
The first step in evaluating your machine learning model for fairness is slicing your model's predictions by the groups you are interested in-for example, gender, race, or country. These slices can be generated by TFMA or the Fairness Indicators tools.

To slice data in TFMA, a slicing column must be provided as a SliceSpec.

slices = [tfma.slicer.SingleSliceSpec(),
      tfma.slicer.SingleSliceSpec(columns=['product'])]
eval_config_viz = tfma.EvalConfig(
model_specs=[tfma.ModelSpec(label_key='consumer_disputed')],
slicing_specs=[tfma.SlicingSpec(), tfma.SlicingSpec(feature_keys=['product'])],
metrics_specs=[
  tfma.MetricsSpec(metrics=[
    tfma.MetricConfig(class_name="BinaryAccuracy"),
    tfma.MetricConfig(class_name="ExampleCount"),
    tfma.MetricConfig(class_name="FalsePositives"),
    tfma.MetricConfig(class_name="TruePositives"),
    tfma.MetricConfig(class_name="FalseNegatives"),
    tfma.MetricConfig(class_name="TrueNegatives")
  ])
]
)

SingleSliceSpec with no specified arguments returns the entire dataset/

Next, run the model analysis step with the slices specified:

eval_result = tfma.run_model_analysis(
eval_shared_model=eval_shared_model,
eval_config=eval_config_viz,
data_location=_EVAL_DATA_FILE,
output_path=_EVAL_RESULT_LOCATION,
file_format='tfrecords',
slice_spec=slices
)

And view the results

eval_result = tfma.run_model_analysis(
eval_shared_model=eval_shared_model,
eval_config=eval_config_viz,
data_location=_EVAL_DATA_FILE,
output_path=_EVAL_RESULT_LOCATION,
file_format='tfrecords',
slice_spec=slices
)

We install the TensorBoard Fairness Indicators plug-in via:
```
$ pip install tensorboard_plugin_fairness_indicators
```

Next, we use TFMA to evaluate the model and ask it to calculate metrics for a set of decision thresholds we supply. This is supplied to TFMA in the metrics_spec argument for the EvalConfig, along with any other metrics we wish to calculate:

eval_config = tfma.EvalConfig(
model_specs=[tfma.ModelSpec(label_key='consumer_disputed')],
slicing_specs=[tfma.SlicingSpec(), tfma.SlicingSpec(feature_keys=['product'])],
metrics_specs=[
  tfma.MetricsSpec(metrics=[
    tfma.MetricConfig(class_name="FairnessIndicators",
                       config='{"thresholds":[0.25, 0.5, 0.75]}')
  ])
]
)

Then run the model analysis step as before via tfma.run_model_analysis.

eval_result_fairness = tfma.run_model_analysis(
eval_shared_model=eval_shared_model,
eval_config=eval_config,
data_location=_EVAL_DATA_FILE,
output_path='./eval_result_fairness',
file_format='tfrecords',
slice_spec=slices
)

Next, write the TFMA evaluation result to a log directory so that it can be picked up by TensorBoard:
```
from tensorboard_plugin_fairness_indicators import summary_v2
```

writer = tf.summary.create_file_writer('./fairness_indicator_logs') with writer.as_default(): summary_v2.FairnessIndicators('./eval_result_fairness', step=1) writer.close()

- And load the result in TensorBoard to a Jupyter Notebook:

%load_ext tensorboard %tensorboard --logdir=./fairness_indicator_logs

- Going Deeper with the What-If tool
- We can install the WIT with:

$ pip install witwidget

- Next, we create a TFRecordDataset to load the data file. We sample 1,000 training examples and convert it to a list of TFExamples. The visualizations in the What-If-Tool work well with this number of training examples, but they get harder to understand with a larger sample:
```python
eval_data = tf.data.TFRecordDataset(_EVAL_DATA_FILE, compression_type="GZIP")
subset = eval_data.take(1000)
eval_examples = [tf.train.Example.FromString(d.numpy()) for d in subset]

Next, we load the model and define a prediction function that takes in the list of TFExamples and returns the model's predictions:
```
model = tf.saved_model.load(export_dir=_MODEL_DIR)
predict_fn = model.signatures['serving_default']
```

def predict(test_examples): test_examples = tf.constant([example.SerializeToString() for example in test_examples]) preds = predict_fn(examples=test_examples) return preds['outputs'].numpy()

- Then we configure the WIT using:
```python
from witwidget.notebook.visualization import WitConfigBuilder
config_builder = WitConfigBuilder(eval_examples).set_custom_predict_fn(predict)

And we can view it in a notebook using:

from witwidget.notebook.visualization import WitWidget
WitWidget(config_builder)

PDPs show the change in prediction results (the inference score) for different valid values of a feature. There is no change in the inference score across the company feature, showing that the predictions for this data point don't depend on the value of this feature. But for the company_response feature, there is a change in the inference score, which shows that the model prediction has some dependence on the value of the feature.
ResolverNode:
Estimate

Tests

chanelcolgate / hydroelectric-project

Model Analysis and Validation #16

Description

Estimate

Tests