Data Validation - Githubissues

Description

Check for data anomalies
Check that the data schema hasn't changed
Check that the statistics of our new datasets still align with statistics from our previous training datasets
TFDV: The Tensorflow ecosystem offers a tool that can assist you in data validation, TFDV. It is part of the TFX project. TFDV allows you to perform the kind of analyses we discussed previously.
TFDV accepts two input formats to start the data validation: TensorFlow's TFRecord and CSV files. In common with other TFX components, it distributes the analysis using Apache Beam.
Actions
Installation: When we installed the tfx package, TFDV was already installed as a dependency. If we would like to use TFDV as a standalone package, we can install it with this command:
```
$ pip install tensorflow-data-validation
```
Generating Statistics from Your Data
```
import tensorflow_data_validation as tfdv
stats = tfdv.generate_statistics_from_csv(
data_location='/content/data/penguins_size.csv',
delimiter=','
)
```
- We can generate feature statistics from TFRecord files in a very similar way using the following code:
```
stats = tfdv.generate_statistics_from_tfrecord(
data_location='/content/tfx/CsvExampleGen/examples/3/train/data_tfrecord-00000-of-00001.gz'
)
```
- For numerical features, TFDV computes for every feature:
- The overall count of data records
- The number of missing data records
- The mean and standard deviation of the feature across the data records
- The minimum and maximum value of the feature across the data records
- The percentage of zero values of the feature across the data records
- For categorical features, TFDV provides:
- The overall count of data records
- The percentage of missing data records
- The number of unique records
- The average string length of all records of a feature
- For a category, TFDV determines the sample count for each label and its rank
- Generating Schema from Your Data
```
schema = tfdv.infer_schema(stats)
```
```
tfdv.display_schema(schema)
```
- In this visulization, Presence means whether the feature must be present in 100% of data examples (required) or not (optional). Valency means the number of values required per training example. In the case of categorical features, single would mean each training example must have exactly one category for the feature.

Comparing Datasets


train_stats = tfdv.generate_statistics_from_tfrecord(
data_location='/content/tfx/CsvExampleGen/examples/3/train/data_tfrecord-00000-of-00001.gz'
)
val_stats = tfdv.generate_statistics_from_tfrecord(
data_location='/content/tfx/CsvExampleGen/examples/3/test/data_tfrecord-00000-of-00001.gz'
)

tfdv.visualize_statistics(lhs_statistics=val_stats, rhs_statistics=train_stats, lhs_name='VAL_DATASET', rhs_name='TRAIN_DATASET')

- Anomalies can be detected using the following code:
```python
 anomalies = tfdv.validate_statistics(statistics=val_stats, schema=schema)

And we can then display the anomalies with:
```
tfdv.display_anomalies(anomalies)
```
Updating the Schema: The preceding anomaly protocol shows us how to detect variations from the schema that is autogenerated from our dataset. But another use case for TFDV is manually setting the schema according to our domain knowledge of the data. Taking the sub_issue feature discussed previously, if we decide that we need to require this feature to be present in greater than 90% of our training examples, we can update the schema to reflect this.
First, we need to load the schema from its serialized location:
```
schema = tfdv.load_schema_text(schema_location)
```

Then, we update this particular feature so that it is required in 90% of cases:

sub_issue_feature = tfdv.get_feature(schema, 'sub_issue')
sub_issue_feature.presence.min_fraction = 0.9

We could also update the list of US states to remove Alaska:
```
tfdv.write_schema_text(schema, schema_location)
```

We then need to revalidate the statistics to view the updated anomalies:

updated_anomalies = tfdv.validate_statistics(eval_stats, schema)
tfdv.display_anomalies(updated_anomalies)

Data Skew and Drift: TFDV provides a built-in "skew comparator" that detects large differences between the statistics of two datasets

L-infinity Norm

tfdv.get_feature(schema, "\"population\"").skew_comparator.infinity_norm.threshold = 0.01
skew_anomalies = tfdv.validate_statistics(statistics=train_stats, schema=schema, serving_statistics=val_stats)

tfdv.get_feature(schema, "\"population\"").drift_comparator.infinity_norm.threshold = 0.01
skew_anomalies = tfdv.validate_statistics(statistics=train_stats, schema=schema, serving_statistics=val_stats)

Integrating TFDV into Your Machine Learning Pipeline
TFX provides a pipeline component called StatisticsGen, which accepts the output of the previous ExampleGen components as input and then peforms the generation of statistics:
```
from tfx.components import StatisticsGen
```

statistics_gen = StatisticsGen( examples=example_gen.outputs['examples'] ) context.run(statistics_gen) context.show(statistics_gen.outputs['statistics'])

- Generating our schema is just as easy as generating the statistics:
```python
from tfx.components import SchemaGen
schema_gen = SchemaGen(
    statistics=statistics_gen.outputs['statistics'],
    infer_feature_shape=True
)
context.run(schema_gen)

With the statistics and schema in place, we can now validate our new dataset:

from tfx.components import ExampleValidator
example_validator = ExampleValidator(
statistics=statistics_gen.outputs['statistics'],
schema=schema_gen.outputs['schema']
)
context.run(example_validator)

chanelcolgate / hydroelectric-project

Data Validation #13

Description

Actions

Estimate

Tests