chanelcolgate / hydroelectric-project

0 stars 0 forks source link

Data Preprocessing #14

Open chanelcolgate opened 2 years ago

chanelcolgate commented 2 years ago

Description

LABEL_KEY = 'consumer_disputed'

Feature name, feature dimensionality

ONE_HOT_FEATURES = { "product": 11, "sub_product": 45, "company_response": 5, "state": 60, "issue": 90 }

Feature name, bucket count

BUCKET_FEATURES = { "zip_code": 10 }

Feature name, value is unused

TEXT_FEATURES = { "consumer_complaint_narrative": None

- Before we can loop over these input feature dictionaries, let's define a few helper functions to transform the data efficiently. It is good practice to rename the features by appending a suffix to the feature name (e,g, _xf). The suffix will help distinguish whether errors are originating from input or output features and prevent us from accidentally using a nontransformed feature in our actual model:
```python
def transformed_name(key):
  return key + '_xf'

transform_file = os.path.join(base_dir, 'components/transform.py') transform = Transform( examples=example_gen.outputs['examples'], schema=schema_gen.outputs['schema'], module_file=transform_file ) context.run(transform)


#### Estimate
#### Tests
- [Data Preprocessing](https://colab.research.google.com/drive/1U_ELlx6BsFaEKw6zpLaDAydku3Vs_Htw#scrollTo=Q_903APKUDY1)