Serialization of DataPreparer pipeline.

Description

The DataPreparer should be able to be serialized and deserialized. The base serialization Mixin enables this naively (i.e., by calling .serialize(deep=True) method on the DataPreparer). The output is currently a nested json object with many redundant/duplicate (i.e., column_sharer) and not so readable raw python object tag (i.e., py/). The goal is to implement OO serialization that enables custom filtering for each component during serialization such that it decides what should and should not be displayed.

For instance,

NoTransform should be hidden - this functionality may want to be backported to the base serializer mixin.
each step will need a 'column view' whereby you can see what transformer was mapped to each column, at each step in DataPreparer
column_sharer will be serialized multiple times at each substep has the shared instance of it. It should either be displayed only once or not at all.

One rule of thumb is that if there are duplication for the same instance, they should not be shown in the serialized object.

What should be included in the output

In a format that can represents nested hierarchy of the data_preprarer, the list of steps, and the transformer in each steps, we want to include all the leaf parameters (of the objects just mentioned) that have changed from their default values.

Q: Then how does user know what are the existing parameters that they can potentially tune/override if we only output changed parameters? What if they are all default values? @adithyabsk could you comment on this?

Update based on discussion with @adithyabsk :

We will also need to keep fields like _method and _class so that the deserialization method knows what method to deserialize the data and to what class it should deserialize it.

Success Criteria

DataPreparer is able to be serialized to json and yaml format.
DataPreparer is able to be deserialized back from json and yaml files.
The serialized output should be free of redundant info and human-friendly.

Subtasks

[x] Implement (or fix) the ConcreteSerializerMixin on the PreparerStep class, invoking to_json(). Spent 2 hours on this task
[x] .from_json() bug due to get_params() of data_preparer, possibly blocked by #129 . Spent 1 hour just to root cause the issue
[x] Based on the output of serialization (after we finish subtask 1), reference the V1 serialization format to design a draft version of the V2 format.
[x] Review the draft with the team.
[x] Implement the serialization by overriding the .serialize() methods in each component if certain custom filtering on what is returned by .get_params() is required.
[x] Implement the serialization by overriding the .deserialize() methods in each component if certain custom filtering on what is returned by .get_params() is required.
[ ] Unit tests for new code
[ ] (lower priority) Code coverage for old code

TODO:

[x] Read the code to understand the flow of get_params, set_params, serialize and deserialize.

Estimate

1 to 2 days for the first 3 tasks
2 to 3 days for the last 2 tasks Total: 3 to 5 days (and by a day I mean 4 uninterrupted hours per day).

Fixed the issue with recursive serialization on PrepareStep that inherits/implements the ConcreteSerializerMixin . The _method argument was introduced into the kwargs during the call in _make_serializable(), which breaks {concrete}_serialize method (in this case dict_method) since it does not accept _method as a keyword argument.

The solution is to extract and pop the _method from kwargs and assign it to the method variable if it exists in kwargs. In this way we can pass down the serialize method recursively without breaking downstream code. This solution also avoids changing the current unit tests.

However, it also means that we have untested code branches and they need to be thoroughly tested before we can claim (de)serialization is working.

Currently, .get_params() on data preparer shows the following info, which includes a lot of info that it should not include as mentioned in #129 . This in turn breaks the code of deserialization.

{'cleaner_kwargs': None,
 'column_sharer': <foreshadow.columnsharer.ColumnSharer object at 0x12c0a1198>,
 'data_cleaner': CleanerMapper(_parallel_process=ParallelProcessor(collapse_index=True, n_jobs=1,
         transformer_list=[('group: 0', DynamicPipeline(memory=None,
        steps=[('Flatten', Flatten(check_wrapped=True, transformer={'class_name': 'NoTransform'})), ('Cleaner', Cleaner(check_wrapped=False, transformer={'class_nam...False, transformer={'class_name': 'NoTransform'}))]), ['medv'])],
         transformer_weights=None),
       column_sharer=<foreshadow.columnsharer.ColumnSharer object at 0x12c0a1198>),
 'data_cleaner___parallel_process': ParallelProcessor(collapse_index=True, n_jobs=1,
         transformer_list=[('group: 0', DynamicPipeline(memory=None,
        steps=[('Flatten', Flatten(check_wrapped=True, transformer={'class_name': 'NoTransform'})), ('Cleaner', Cleaner(check_wrapped=False, transformer={'class_name': 'NoTransform'}))]), ['crim']), ('group: 1', DynamicPipeline(memory=None,..., ('Cleaner', Cleaner(check_wrapped=False, transformer={'class_name': 'NoTransform'}))]), ['medv'])],
         transformer_weights=None),
 'data_cleaner__column_sharer': <foreshadow.columnsharer.ColumnSharer object at 0x12c0a1198>,
 'engineerer_kwargs': None,
 'feature_engineerer': FeatureEngineererMapper(_parallel_process=ParallelProcessor(collapse_index=True, n_jobs=1,
         transformer_list=[('group: 0', DynamicPipeline(memory=None,
        steps=[('FeatureEngineerer', FeatureEngineerer(check_wrapped=True,
         transformer={'class_name': 'NoTransform'}))]), ['crim', 'rm', 'age', 'dis', 'b',...}))]), ['zn', 'indus', 'chas', 'nox', 'rad', 'tax', 'ptratio'])],
         transformer_weights=None),
            column_sharer=<foreshadow.columnsharer.ColumnSharer object at 0x12c0a1198>),
 'feature_engineerer___parallel_process': ParallelProcessor(collapse_index=True, n_jobs=1,
         transformer_list=[('group: 0', DynamicPipeline(memory=None,
        steps=[('FeatureEngineerer', FeatureEngineerer(check_wrapped=True,
         transformer={'class_name': 'NoTransform'}))]), ['crim', 'rm', 'age', 'dis', 'b', 'lstat', 'medv']), ('group: 1', DynamicPipeline(memory=None,
        steps=[('FeatureEngineerer', FeatureEngineerer(check_wrapped=True,
         transformer={'class_name': 'NoTransform'}))]), ['zn', 'indus', 'chas', 'nox', 'rad', 'tax', 'ptratio'])],
         transformer_weights=None),
 'feature_engineerer__column_sharer': <foreshadow.columnsharer.ColumnSharer object at 0x12c0a1198>,
 'feature_preprocessor': Preprocessor(_parallel_process=ParallelProcessor(collapse_index=True, n_jobs=1,
         transformer_list=[('group: 0', DynamicPipeline(memory=None,
        steps=[('Imputer', DFTransformer: Imputer), ('Scaler', Scaler(p_val=0.05,
    transformer={'memory': None, 'steps': [('box_cox', DFTransformer: BoxCox), ('r...blePipeline'},
          unique_num_cutoff=30))]), ['ptratio'])],
         transformer_weights=None),
       column_sharer=<foreshadow.columnsharer.ColumnSharer object at 0x12c0a1198>),
 'feature_preprocessor___parallel_process': ParallelProcessor(collapse_index=True, n_jobs=1,
         transformer_list=[('group: 0', DynamicPipeline(memory=None,
        steps=[('Imputer', DFTransformer: Imputer), ('Scaler', Scaler(p_val=0.05,
    transformer={'memory': None, 'steps': [('box_cox', DFTransformer: BoxCox), ('robust_scaler', DFTransformer: RobustScaler)], 'class_name': 'SerializablePip...tEncoder)], 'class_name': 'SerializablePipeline'},
          unique_num_cutoff=30))]), ['ptratio'])],
         transformer_weights=None),
 'feature_preprocessor__column_sharer': <foreshadow.columnsharer.ColumnSharer object at 0x12c0a1198>,
 'feature_reducer': FeatureReducerMapper(_parallel_process=ParallelProcessor(collapse_index=True, n_jobs=1,
         transformer_list=[('group: 0', DynamicPipeline(memory=None,
        steps=[('FeatureReducer', FeatureReducer(check_wrapped=False, transformer={'class_name': 'NoTransform'}))]), ['crim', 'rm', 'age', 'dis', 'b', 'lstat', 'med...ptratio_17.6', 'ptratio_18.4', 'ptratio_19.6', 'ptratio_20.2'])],
         transformer_weights=None),
           column_sharer=<foreshadow.columnsharer.ColumnSharer object at 0x12c0a1198>),
 'feature_reducer___parallel_process': ParallelProcessor(collapse_index=True, n_jobs=1,
         transformer_list=[('group: 0', DynamicPipeline(memory=None,
        steps=[('FeatureReducer', FeatureReducer(check_wrapped=False, transformer={'class_name': 'NoTransform'}))]), ['crim', 'rm', 'age', 'dis', 'b', 'lstat', 'medv']), ('group: 1', DynamicPipeline(memory=None,
        steps=[('FeatureRedu...', 'ptratio_17.4', 'ptratio_13.0', 'ptratio_17.6', 'ptratio_18.4', 'ptratio_19.6', 'ptratio_20.2'])],
         transformer_weights=None),
 'feature_reducer__column_sharer': <foreshadow.columnsharer.ColumnSharer object at 0x12c0a1198>,
 'intent': IntentMapper(_parallel_process=ParallelProcessor(collapse_index=True, n_jobs=1,
         transformer_list=[('group: 0', DynamicPipeline(memory=None,
        steps=[('IntentResolver', IntentResolver(transformer={'class_name': 'Numeric'}))]), ['crim']), ('group: 1', DynamicPipeline(memory=None,
        steps=[('In...ntResolver(transformer={'class_name': 'Numeric'}))]), ['medv'])],
         transformer_weights=None),
       column_sharer=<foreshadow.columnsharer.ColumnSharer object at 0x12c0a1198>),
 'intent___parallel_process': ParallelProcessor(collapse_index=True, n_jobs=1,
         transformer_list=[('group: 0', DynamicPipeline(memory=None,
        steps=[('IntentResolver', IntentResolver(transformer={'class_name': 'Numeric'}))]), ['crim']), ('group: 1', DynamicPipeline(memory=None,
        steps=[('IntentResolver', IntentResolver(transformer={'class_name': 'Categoric'}))]), [...      steps=[('IntentResolver', IntentResolver(transformer={'class_name': 'Numeric'}))]), ['medv'])],
         transformer_weights=None),
 'intent__column_sharer': <foreshadow.columnsharer.ColumnSharer object at 0x12c0a1198>,
 'intent_kwargs': None,
 'modeler_kwargs': None,
 'preprocessor_kwargs': None,
 'reducer_kwargs': None,
 'y_var': None}

Currently working on the design of the serialized output. The starting point I have right now is the ColumnSharer and we are going to override the dict_serialize method by using the jsonpicker and manipulate the output there. Then we try to deserialize it back. This will give me a better understanding of how this process work.

Data Preparer deserialization does not work due to issue with "unpicklering" some data with the error: E IndexError: list index out of range in .pyenv/versions/3.6.8/envs/venv/lib/python3.6/site-packages/jsonpickle/unpickler.py:97: IndexError

Serialization of a bare minimal data_preparer with only a data_cleaner on a dataset with 1 column is completed as follows:

{
  "cleaner_kwargs": null,
  "intent_kwargs": null,
  "engineerer_kwargs": null,
  "preprocessor_kwargs": null,
  "reducer_kwargs": null,
  "modeler_kwargs": null,
  "y_var": null,
  "steps": [
    {
      "data_cleaner": {
        "n_jobs": 1,
        "transformer_weights": null,
        "collapse_index": true,
        "_class": "CleanerMapper",
        "_method": "dict",
        "transformation_by_column_group": [
          {
            "crim": {
              "memory": null,
              "steps": [
                {
                  "Flatten": {
                    "check_wrapped": true,
                    "force_reresolve": false,
                    "keep_columns": false,
                    "name": "Flatten",
                    "should_resolve": false,
                    "transformer": {
                      "_class": "NoTransform",
                      "_method": "dict"
                    },
                    "y_var": false,
                    "_class": "Flatten",
                    "_method": "dict"
                  }
                },
                {
                  "Cleaner": {
                    "check_wrapped": false,
                    "force_reresolve": false,
                    "keep_columns": false,
                    "name": "Cleaner",
                    "should_resolve": false,
                    "transformer": {
                      "_class": "NoTransform",
                      "_method": "dict"
                    },
                    "y_var": false,
                    "_class": "Cleaner",
                    "_method": "dict"
                  }
                }
              ],
              "_class": "DynamicPipeline",
              "_method": "dict"
            }
          }
        ]
      }
    }
  ],
  "column_sharer": {
    "store": {
      "domain": {
        "crim": "NoTransform"
      }
    },
    "_class": "ColumnSharer",
    "_method": "dict"
  },
  "_class": "DataPreparer",
  "_method": "dict"
}

The main challenge is to sift through and remove all the verbose output from jsonpickler but at the same time keep all the valuable content.

Currently there are three issues:

[x] Serialized and deserialized Scalers are different due to how params with tuple type are treated in the process. If we are using jsonpickler, tuple will be tagged with py/tuple. However, since we are not using it, that info is lost and we deserialize back, it is treated as a list instead of a tuple. This may or may not be an issue, depending on how the tuple is used. For example, for the quantile_range in the RobustScaler, it is only used to access the min/max value so tuple and list both work but it may not work in other cases.
[x] We are not serializing the fitted field, for example, the scaler has a field center_. It's not part of the serialized object and thus nowhere to be found after deserialization. In other words, we are only serializing and deserializing an unfitted pipeline. Is this intentional?
[x] ColumnSharer, even after the preprocessing step, does not include the one-hot encoded categorical columns. This seems like a problem? Could this be a bug?
[ ] Given the way we are group columns by intent, the output can still be very verbose, especially if we have a large number of columns with the same intent. The column group is currently a list of columns joined by , so it could get very long. Are we OK with this?

For item 3: Strangely, if I re-enable feature reducer in the DataPreparer, we have all those encoded columns again in the serialized ColumnSharer. Definitely need a deep dive to understand how it works. Turn it back on and off to compare the output.

Update: It is because of the self.check_resolve() method of the AutoIntentMixin and FeatureEngineerer, Preprocessor and FeatureReducer all implement. Since the generated columns are created after the preprocessing step, they are populated in the column_sharer only in the feature_reducer when self.check_resolve() is invoked.

For item 1 (scroll to the right):

E         'copy': True,
E         -                                                                                                                                                      'quantile_range': (25.0,
E         ?                                                                                                                                                                        ^
E         +                                                                                                                                                      'quantile_range': [25.0,
E         ?                                                                                                                                                                        ^
E         -                                                                                                                                                                         75.0),
E         ?                                                                                                                                                                             ^
E         +                                                                                                                                                                         75.0],
E         ?                                                                                                                                                                             ^

We are not serializing the fitted field, for example, the scaler has a field center_. It's not part of the serialized object and thus nowhere to be found after deserialization. In other words, we are only serializing and deserializing an unfitted pipeline. Is this intentional?

@JingJZ160 Yes, the idea is that there are three types of serialization:

User facing params, ie not the internal state (dict serialize)
Internal state, ie save params such as center_ in pickle a. inline (in the serialized representation) b. symbolic (saved to disk with a link to the saved location)

We are not serializing the fitted field, for example, the scaler has a field center_. It's not part of the serialized object and thus nowhere to be found after deserialization. In other words, we are only serializing and deserializing an unfitted pipeline. Is this intentional?

@JingJZ160 Yes, the idea is that there are three types of serialization:

User facing params, ie not the internal state (dict serialize)

Internal state, ie save params such as center_ in pickle a. inline (in the serialized representation) b. symbolic (saved to disk with a link to the saved location)

Thanks for the clarification @adithyabsk

For item 1, we will need to implement a customized json encoder/decoder if we want to handle the tuple type serialization and deserialization: https://stackoverflow.com/questions/15721363/preserve-python-tuples-with-json

I'll create a new issue for this.

No longer a concern. We may add a task to simplify the whole serialized json or just create a python file to include the whole trained pipeline there like TPOT.

georgian-io-archive / foreshadow