Check that the statistics of our new datasets still align with statistics from our previous training datasets
TFDV: The Tensorflow ecosystem offers a tool that can assist you in data validation, TFDV. It is part of the TFX project. TFDV allows you to perform the kind of analyses we discussed previously.
TFDV accepts two input formats to start the data validation: TensorFlow's TFRecord and CSV files. In common with other TFX components, it distributes the analysis using Apache Beam.
Actions
Installation: When we installed the tfx package, TFDV was already installed as a dependency. If we would like to use TFDV as a standalone package, we can install it with this command:
$ pip install tensorflow-data-validation
Generating Statistics from Your Data
import tensorflow_data_validation as tfdv
stats = tfdv.generate_statistics_from_csv(
data_location='/content/data/penguins_size.csv',
delimiter=','
)
We can generate feature statistics from TFRecord files in a very similar way using the following code:
For numerical features, TFDV computes for every feature:
The overall count of data records
The number of missing data records
The mean and standard deviation of the feature across the data records
The minimum and maximum value of the feature across the data records
The percentage of zero values of the feature across the data records
For categorical features, TFDV provides:
The overall count of data records
The percentage of missing data records
The number of unique records
The average string length of all records of a feature
For a category, TFDV determines the sample count for each label and its rank
Generating Schema from Your Data
schema = tfdv.infer_schema(stats)
tfdv.display_schema(schema)
In this visulization, Presence means whether the feature must be present in 100% of data examples (required) or not (optional). Valency means the number of values required per training example. In the case of categorical features, single would mean each training example must have exactly one category for the feature.
- Anomalies can be detected using the following code:
```python
anomalies = tfdv.validate_statistics(statistics=val_stats, schema=schema)
And we can then display the anomalies with:
tfdv.display_anomalies(anomalies)
Updating the Schema: The preceding anomaly protocol shows us how to detect variations from the schema that is autogenerated from our dataset. But another use case for TFDV is manually setting the schema according to our domain knowledge of the data. Taking the sub_issue feature discussed previously, if we decide that we need to require this feature to be present in greater than 90% of our training examples, we can update the schema to reflect this.
First, we need to load the schema from its serialized location:
schema = tfdv.load_schema_text(schema_location)
Then, we update this particular feature so that it is required in 90% of cases:
Integrating TFDV into Your Machine Learning Pipeline
TFX provides a pipeline component called StatisticsGen, which accepts the output of the previous ExampleGen components as input and then peforms the generation of statistics:
- Generating our schema is just as easy as generating the statistics:
```python
from tfx.components import SchemaGen
schema_gen = SchemaGen(
statistics=statistics_gen.outputs['statistics'],
infer_feature_shape=True
)
context.run(schema_gen)
With the statistics and schema in place, we can now validate our new dataset:
Description
Actions
tfdv.visualize_statistics(lhs_statistics=val_stats, rhs_statistics=train_stats, lhs_name='VAL_DATASET', rhs_name='TRAIN_DATASET')
statistics_gen = StatisticsGen( examples=example_gen.outputs['examples'] ) context.run(statistics_gen) context.show(statistics_gen.outputs['statistics'])
Estimate
Tests