georgian-io-archive / foreshadow

An automatic machine learning system
https://foreshadow.readthedocs.io
Apache License 2.0
29 stars 2 forks source link

Serialization of DataPreparer pipeline. #132

Closed cchoquette closed 4 years ago

cchoquette commented 5 years ago

Description

The DataPreparer should be able to be serialized and deserialized. The base serialization Mixin enables this naively (i.e., by calling .serialize(deep=True) method on the DataPreparer). The output is currently a nested json object with many redundant/duplicate (i.e., column_sharer) and not so readable raw python object tag (i.e., py/). The goal is to implement OO serialization that enables custom filtering for each component during serialization such that it decides what should and should not be displayed.

For instance,

  1. NoTransform should be hidden - this functionality may want to be backported to the base serializer mixin.
  2. each step will need a 'column view' whereby you can see what transformer was mapped to each column, at each step in DataPreparer
  3. column_sharer will be serialized multiple times at each substep has the shared instance of it. It should either be displayed only once or not at all.

One rule of thumb is that if there are duplication for the same instance, they should not be shown in the serialized object.

What should be included in the output

In a format that can represents nested hierarchy of the data_preprarer, the list of steps, and the transformer in each steps, we want to include all the leaf parameters (of the objects just mentioned) that have changed from their default values.

Q: Then how does user know what are the existing parameters that they can potentially tune/override if we only output changed parameters? What if they are all default values? @adithyabsk could you comment on this?

Update based on discussion with @adithyabsk :

Success Criteria

Subtasks

TODO:

Estimate

jzhang-gp commented 5 years ago

Fixed the issue with recursive serialization on PrepareStep that inherits/implements the ConcreteSerializerMixin . The _method argument was introduced into the kwargs during the call in _make_serializable(), which breaks {concrete}_serialize method (in this case dict_method) since it does not accept _method as a keyword argument.

The solution is to extract and pop the _method from kwargs and assign it to the method variable if it exists in kwargs. In this way we can pass down the serialize method recursively without breaking downstream code. This solution also avoids changing the current unit tests.

However, it also means that we have untested code branches and they need to be thoroughly tested before we can claim (de)serialization is working.

jzhang-gp commented 5 years ago

Currently, .get_params() on data preparer shows the following info, which includes a lot of info that it should not include as mentioned in #129 . This in turn breaks the code of deserialization.

{'cleaner_kwargs': None,
 'column_sharer': <foreshadow.columnsharer.ColumnSharer object at 0x12c0a1198>,
 'data_cleaner': CleanerMapper(_parallel_process=ParallelProcessor(collapse_index=True, n_jobs=1,
         transformer_list=[('group: 0', DynamicPipeline(memory=None,
        steps=[('Flatten', Flatten(check_wrapped=True, transformer={'class_name': 'NoTransform'})), ('Cleaner', Cleaner(check_wrapped=False, transformer={'class_nam...False, transformer={'class_name': 'NoTransform'}))]), ['medv'])],
         transformer_weights=None),
       column_sharer=<foreshadow.columnsharer.ColumnSharer object at 0x12c0a1198>),
 'data_cleaner___parallel_process': ParallelProcessor(collapse_index=True, n_jobs=1,
         transformer_list=[('group: 0', DynamicPipeline(memory=None,
        steps=[('Flatten', Flatten(check_wrapped=True, transformer={'class_name': 'NoTransform'})), ('Cleaner', Cleaner(check_wrapped=False, transformer={'class_name': 'NoTransform'}))]), ['crim']), ('group: 1', DynamicPipeline(memory=None,..., ('Cleaner', Cleaner(check_wrapped=False, transformer={'class_name': 'NoTransform'}))]), ['medv'])],
         transformer_weights=None),
 'data_cleaner__column_sharer': <foreshadow.columnsharer.ColumnSharer object at 0x12c0a1198>,
 'engineerer_kwargs': None,
 'feature_engineerer': FeatureEngineererMapper(_parallel_process=ParallelProcessor(collapse_index=True, n_jobs=1,
         transformer_list=[('group: 0', DynamicPipeline(memory=None,
        steps=[('FeatureEngineerer', FeatureEngineerer(check_wrapped=True,
         transformer={'class_name': 'NoTransform'}))]), ['crim', 'rm', 'age', 'dis', 'b',...}))]), ['zn', 'indus', 'chas', 'nox', 'rad', 'tax', 'ptratio'])],
         transformer_weights=None),
            column_sharer=<foreshadow.columnsharer.ColumnSharer object at 0x12c0a1198>),
 'feature_engineerer___parallel_process': ParallelProcessor(collapse_index=True, n_jobs=1,
         transformer_list=[('group: 0', DynamicPipeline(memory=None,
        steps=[('FeatureEngineerer', FeatureEngineerer(check_wrapped=True,
         transformer={'class_name': 'NoTransform'}))]), ['crim', 'rm', 'age', 'dis', 'b', 'lstat', 'medv']), ('group: 1', DynamicPipeline(memory=None,
        steps=[('FeatureEngineerer', FeatureEngineerer(check_wrapped=True,
         transformer={'class_name': 'NoTransform'}))]), ['zn', 'indus', 'chas', 'nox', 'rad', 'tax', 'ptratio'])],
         transformer_weights=None),
 'feature_engineerer__column_sharer': <foreshadow.columnsharer.ColumnSharer object at 0x12c0a1198>,
 'feature_preprocessor': Preprocessor(_parallel_process=ParallelProcessor(collapse_index=True, n_jobs=1,
         transformer_list=[('group: 0', DynamicPipeline(memory=None,
        steps=[('Imputer', DFTransformer: Imputer), ('Scaler', Scaler(p_val=0.05,
    transformer={'memory': None, 'steps': [('box_cox', DFTransformer: BoxCox), ('r...blePipeline'},
          unique_num_cutoff=30))]), ['ptratio'])],
         transformer_weights=None),
       column_sharer=<foreshadow.columnsharer.ColumnSharer object at 0x12c0a1198>),
 'feature_preprocessor___parallel_process': ParallelProcessor(collapse_index=True, n_jobs=1,
         transformer_list=[('group: 0', DynamicPipeline(memory=None,
        steps=[('Imputer', DFTransformer: Imputer), ('Scaler', Scaler(p_val=0.05,
    transformer={'memory': None, 'steps': [('box_cox', DFTransformer: BoxCox), ('robust_scaler', DFTransformer: RobustScaler)], 'class_name': 'SerializablePip...tEncoder)], 'class_name': 'SerializablePipeline'},
          unique_num_cutoff=30))]), ['ptratio'])],
         transformer_weights=None),
 'feature_preprocessor__column_sharer': <foreshadow.columnsharer.ColumnSharer object at 0x12c0a1198>,
 'feature_reducer': FeatureReducerMapper(_parallel_process=ParallelProcessor(collapse_index=True, n_jobs=1,
         transformer_list=[('group: 0', DynamicPipeline(memory=None,
        steps=[('FeatureReducer', FeatureReducer(check_wrapped=False, transformer={'class_name': 'NoTransform'}))]), ['crim', 'rm', 'age', 'dis', 'b', 'lstat', 'med...ptratio_17.6', 'ptratio_18.4', 'ptratio_19.6', 'ptratio_20.2'])],
         transformer_weights=None),
           column_sharer=<foreshadow.columnsharer.ColumnSharer object at 0x12c0a1198>),
 'feature_reducer___parallel_process': ParallelProcessor(collapse_index=True, n_jobs=1,
         transformer_list=[('group: 0', DynamicPipeline(memory=None,
        steps=[('FeatureReducer', FeatureReducer(check_wrapped=False, transformer={'class_name': 'NoTransform'}))]), ['crim', 'rm', 'age', 'dis', 'b', 'lstat', 'medv']), ('group: 1', DynamicPipeline(memory=None,
        steps=[('FeatureRedu...', 'ptratio_17.4', 'ptratio_13.0', 'ptratio_17.6', 'ptratio_18.4', 'ptratio_19.6', 'ptratio_20.2'])],
         transformer_weights=None),
 'feature_reducer__column_sharer': <foreshadow.columnsharer.ColumnSharer object at 0x12c0a1198>,
 'intent': IntentMapper(_parallel_process=ParallelProcessor(collapse_index=True, n_jobs=1,
         transformer_list=[('group: 0', DynamicPipeline(memory=None,
        steps=[('IntentResolver', IntentResolver(transformer={'class_name': 'Numeric'}))]), ['crim']), ('group: 1', DynamicPipeline(memory=None,
        steps=[('In...ntResolver(transformer={'class_name': 'Numeric'}))]), ['medv'])],
         transformer_weights=None),
       column_sharer=<foreshadow.columnsharer.ColumnSharer object at 0x12c0a1198>),
 'intent___parallel_process': ParallelProcessor(collapse_index=True, n_jobs=1,
         transformer_list=[('group: 0', DynamicPipeline(memory=None,
        steps=[('IntentResolver', IntentResolver(transformer={'class_name': 'Numeric'}))]), ['crim']), ('group: 1', DynamicPipeline(memory=None,
        steps=[('IntentResolver', IntentResolver(transformer={'class_name': 'Categoric'}))]), [...      steps=[('IntentResolver', IntentResolver(transformer={'class_name': 'Numeric'}))]), ['medv'])],
         transformer_weights=None),
 'intent__column_sharer': <foreshadow.columnsharer.ColumnSharer object at 0x12c0a1198>,
 'intent_kwargs': None,
 'modeler_kwargs': None,
 'preprocessor_kwargs': None,
 'reducer_kwargs': None,
 'y_var': None}
jzhang-gp commented 5 years ago

Currently working on the design of the serialized output. The starting point I have right now is the ColumnSharer and we are going to override the dict_serialize method by using the jsonpicker and manipulate the output there. Then we try to deserialize it back. This will give me a better understanding of how this process work.

jzhang-gp commented 5 years ago

Data Preparer deserialization does not work due to issue with "unpicklering" some data with the error: E IndexError: list index out of range in .pyenv/versions/3.6.8/envs/venv/lib/python3.6/site-packages/jsonpickle/unpickler.py:97: IndexError

jzhang-gp commented 5 years ago

Serialization of a bare minimal data_preparer with only a data_cleaner on a dataset with 1 column is completed as follows:

{
  "cleaner_kwargs": null,
  "intent_kwargs": null,
  "engineerer_kwargs": null,
  "preprocessor_kwargs": null,
  "reducer_kwargs": null,
  "modeler_kwargs": null,
  "y_var": null,
  "steps": [
    {
      "data_cleaner": {
        "n_jobs": 1,
        "transformer_weights": null,
        "collapse_index": true,
        "_class": "CleanerMapper",
        "_method": "dict",
        "transformation_by_column_group": [
          {
            "crim": {
              "memory": null,
              "steps": [
                {
                  "Flatten": {
                    "check_wrapped": true,
                    "force_reresolve": false,
                    "keep_columns": false,
                    "name": "Flatten",
                    "should_resolve": false,
                    "transformer": {
                      "_class": "NoTransform",
                      "_method": "dict"
                    },
                    "y_var": false,
                    "_class": "Flatten",
                    "_method": "dict"
                  }
                },
                {
                  "Cleaner": {
                    "check_wrapped": false,
                    "force_reresolve": false,
                    "keep_columns": false,
                    "name": "Cleaner",
                    "should_resolve": false,
                    "transformer": {
                      "_class": "NoTransform",
                      "_method": "dict"
                    },
                    "y_var": false,
                    "_class": "Cleaner",
                    "_method": "dict"
                  }
                }
              ],
              "_class": "DynamicPipeline",
              "_method": "dict"
            }
          }
        ]
      }
    }
  ],
  "column_sharer": {
    "store": {
      "domain": {
        "crim": "NoTransform"
      }
    },
    "_class": "ColumnSharer",
    "_method": "dict"
  },
  "_class": "DataPreparer",
  "_method": "dict"
}

The main challenge is to sift through and remove all the verbose output from jsonpickler but at the same time keep all the valuable content.

jzhang-gp commented 5 years ago

Currently there are three issues:

For item 3: Strangely, if I re-enable feature reducer in the DataPreparer, we have all those encoded columns again in the serialized ColumnSharer. Definitely need a deep dive to understand how it works. Turn it back on and off to compare the output.

Update: It is because of the self.check_resolve() method of the AutoIntentMixin and FeatureEngineerer, Preprocessor and FeatureReducer all implement. Since the generated columns are created after the preprocessing step, they are populated in the column_sharer only in the feature_reducer when self.check_resolve() is invoked.

For item 1 (scroll to the right):

E         'copy': True,
E         -                                                                                                                                                      'quantile_range': (25.0,
E         ?                                                                                                                                                                        ^
E         +                                                                                                                                                      'quantile_range': [25.0,
E         ?                                                                                                                                                                        ^
E         -                                                                                                                                                                         75.0),
E         ?                                                                                                                                                                             ^
E         +                                                                                                                                                                         75.0],
E         ?                                                                                                                                                                             ^
adithyabsk commented 5 years ago

We are not serializing the fitted field, for example, the scaler has a field center_. It's not part of the serialized object and thus nowhere to be found after deserialization. In other words, we are only serializing and deserializing an unfitted pipeline. Is this intentional?

@JingJZ160 Yes, the idea is that there are three types of serialization:

  1. User facing params, ie not the internal state (dict serialize)
  2. Internal state, ie save params such as center_ in pickle a. inline (in the serialized representation) b. symbolic (saved to disk with a link to the saved location)
jzhang-gp commented 5 years ago

We are not serializing the fitted field, for example, the scaler has a field center_. It's not part of the serialized object and thus nowhere to be found after deserialization. In other words, we are only serializing and deserializing an unfitted pipeline. Is this intentional?

@JingJZ160 Yes, the idea is that there are three types of serialization:

  1. User facing params, ie not the internal state (dict serialize)
  2. Internal state, ie save params such as center_ in pickle a. inline (in the serialized representation) b. symbolic (saved to disk with a link to the saved location)

Thanks for the clarification @adithyabsk

jzhang-gp commented 5 years ago

For item 1, we will need to implement a customized json encoder/decoder if we want to handle the tuple type serialization and deserialization: https://stackoverflow.com/questions/15721363/preserve-python-tuples-with-json

I'll create a new issue for this.

jzhang-gp commented 4 years ago

No longer a concern. We may add a task to simplify the whole serialized json or just create a python file to include the whole trained pipeline there like TPOT.