Closed cchoquette closed 4 years ago
Fixed the issue with recursive serialization on PrepareStep
that inherits/implements the ConcreteSerializerMixin
. The _method
argument was introduced into the kwargs
during the call in _make_serializable()
, which breaks {concrete}_serialize
method (in this case dict_method
) since it does not accept _method
as a keyword argument.
The solution is to extract and pop the _method
from kwargs
and assign it to the method
variable if it exists in kwargs
. In this way we can pass down the serialize method recursively without breaking downstream code. This solution also avoids changing the current unit tests.
However, it also means that we have untested code branches and they need to be thoroughly tested before we can claim (de)serialization is working.
Currently, .get_params()
on data preparer shows the following info, which includes a lot of info that it should not include as mentioned in #129 . This in turn breaks the code of deserialization.
{'cleaner_kwargs': None,
'column_sharer': <foreshadow.columnsharer.ColumnSharer object at 0x12c0a1198>,
'data_cleaner': CleanerMapper(_parallel_process=ParallelProcessor(collapse_index=True, n_jobs=1,
transformer_list=[('group: 0', DynamicPipeline(memory=None,
steps=[('Flatten', Flatten(check_wrapped=True, transformer={'class_name': 'NoTransform'})), ('Cleaner', Cleaner(check_wrapped=False, transformer={'class_nam...False, transformer={'class_name': 'NoTransform'}))]), ['medv'])],
transformer_weights=None),
column_sharer=<foreshadow.columnsharer.ColumnSharer object at 0x12c0a1198>),
'data_cleaner___parallel_process': ParallelProcessor(collapse_index=True, n_jobs=1,
transformer_list=[('group: 0', DynamicPipeline(memory=None,
steps=[('Flatten', Flatten(check_wrapped=True, transformer={'class_name': 'NoTransform'})), ('Cleaner', Cleaner(check_wrapped=False, transformer={'class_name': 'NoTransform'}))]), ['crim']), ('group: 1', DynamicPipeline(memory=None,..., ('Cleaner', Cleaner(check_wrapped=False, transformer={'class_name': 'NoTransform'}))]), ['medv'])],
transformer_weights=None),
'data_cleaner__column_sharer': <foreshadow.columnsharer.ColumnSharer object at 0x12c0a1198>,
'engineerer_kwargs': None,
'feature_engineerer': FeatureEngineererMapper(_parallel_process=ParallelProcessor(collapse_index=True, n_jobs=1,
transformer_list=[('group: 0', DynamicPipeline(memory=None,
steps=[('FeatureEngineerer', FeatureEngineerer(check_wrapped=True,
transformer={'class_name': 'NoTransform'}))]), ['crim', 'rm', 'age', 'dis', 'b',...}))]), ['zn', 'indus', 'chas', 'nox', 'rad', 'tax', 'ptratio'])],
transformer_weights=None),
column_sharer=<foreshadow.columnsharer.ColumnSharer object at 0x12c0a1198>),
'feature_engineerer___parallel_process': ParallelProcessor(collapse_index=True, n_jobs=1,
transformer_list=[('group: 0', DynamicPipeline(memory=None,
steps=[('FeatureEngineerer', FeatureEngineerer(check_wrapped=True,
transformer={'class_name': 'NoTransform'}))]), ['crim', 'rm', 'age', 'dis', 'b', 'lstat', 'medv']), ('group: 1', DynamicPipeline(memory=None,
steps=[('FeatureEngineerer', FeatureEngineerer(check_wrapped=True,
transformer={'class_name': 'NoTransform'}))]), ['zn', 'indus', 'chas', 'nox', 'rad', 'tax', 'ptratio'])],
transformer_weights=None),
'feature_engineerer__column_sharer': <foreshadow.columnsharer.ColumnSharer object at 0x12c0a1198>,
'feature_preprocessor': Preprocessor(_parallel_process=ParallelProcessor(collapse_index=True, n_jobs=1,
transformer_list=[('group: 0', DynamicPipeline(memory=None,
steps=[('Imputer', DFTransformer: Imputer), ('Scaler', Scaler(p_val=0.05,
transformer={'memory': None, 'steps': [('box_cox', DFTransformer: BoxCox), ('r...blePipeline'},
unique_num_cutoff=30))]), ['ptratio'])],
transformer_weights=None),
column_sharer=<foreshadow.columnsharer.ColumnSharer object at 0x12c0a1198>),
'feature_preprocessor___parallel_process': ParallelProcessor(collapse_index=True, n_jobs=1,
transformer_list=[('group: 0', DynamicPipeline(memory=None,
steps=[('Imputer', DFTransformer: Imputer), ('Scaler', Scaler(p_val=0.05,
transformer={'memory': None, 'steps': [('box_cox', DFTransformer: BoxCox), ('robust_scaler', DFTransformer: RobustScaler)], 'class_name': 'SerializablePip...tEncoder)], 'class_name': 'SerializablePipeline'},
unique_num_cutoff=30))]), ['ptratio'])],
transformer_weights=None),
'feature_preprocessor__column_sharer': <foreshadow.columnsharer.ColumnSharer object at 0x12c0a1198>,
'feature_reducer': FeatureReducerMapper(_parallel_process=ParallelProcessor(collapse_index=True, n_jobs=1,
transformer_list=[('group: 0', DynamicPipeline(memory=None,
steps=[('FeatureReducer', FeatureReducer(check_wrapped=False, transformer={'class_name': 'NoTransform'}))]), ['crim', 'rm', 'age', 'dis', 'b', 'lstat', 'med...ptratio_17.6', 'ptratio_18.4', 'ptratio_19.6', 'ptratio_20.2'])],
transformer_weights=None),
column_sharer=<foreshadow.columnsharer.ColumnSharer object at 0x12c0a1198>),
'feature_reducer___parallel_process': ParallelProcessor(collapse_index=True, n_jobs=1,
transformer_list=[('group: 0', DynamicPipeline(memory=None,
steps=[('FeatureReducer', FeatureReducer(check_wrapped=False, transformer={'class_name': 'NoTransform'}))]), ['crim', 'rm', 'age', 'dis', 'b', 'lstat', 'medv']), ('group: 1', DynamicPipeline(memory=None,
steps=[('FeatureRedu...', 'ptratio_17.4', 'ptratio_13.0', 'ptratio_17.6', 'ptratio_18.4', 'ptratio_19.6', 'ptratio_20.2'])],
transformer_weights=None),
'feature_reducer__column_sharer': <foreshadow.columnsharer.ColumnSharer object at 0x12c0a1198>,
'intent': IntentMapper(_parallel_process=ParallelProcessor(collapse_index=True, n_jobs=1,
transformer_list=[('group: 0', DynamicPipeline(memory=None,
steps=[('IntentResolver', IntentResolver(transformer={'class_name': 'Numeric'}))]), ['crim']), ('group: 1', DynamicPipeline(memory=None,
steps=[('In...ntResolver(transformer={'class_name': 'Numeric'}))]), ['medv'])],
transformer_weights=None),
column_sharer=<foreshadow.columnsharer.ColumnSharer object at 0x12c0a1198>),
'intent___parallel_process': ParallelProcessor(collapse_index=True, n_jobs=1,
transformer_list=[('group: 0', DynamicPipeline(memory=None,
steps=[('IntentResolver', IntentResolver(transformer={'class_name': 'Numeric'}))]), ['crim']), ('group: 1', DynamicPipeline(memory=None,
steps=[('IntentResolver', IntentResolver(transformer={'class_name': 'Categoric'}))]), [... steps=[('IntentResolver', IntentResolver(transformer={'class_name': 'Numeric'}))]), ['medv'])],
transformer_weights=None),
'intent__column_sharer': <foreshadow.columnsharer.ColumnSharer object at 0x12c0a1198>,
'intent_kwargs': None,
'modeler_kwargs': None,
'preprocessor_kwargs': None,
'reducer_kwargs': None,
'y_var': None}
Currently working on the design of the serialized output. The starting point I have right now is the ColumnSharer and we are going to override the dict_serialize
method by using the jsonpicker and manipulate the output there. Then we try to deserialize it back. This will give me a better understanding of how this process work.
Data Preparer deserialization does not work due to issue with "unpicklering" some data with the error:
E IndexError: list index out of range
in .pyenv/versions/3.6.8/envs/venv/lib/python3.6/site-packages/jsonpickle/unpickler.py:97: IndexError
Serialization of a bare minimal data_preparer
with only a data_cleaner
on a dataset with 1 column is completed as follows:
{
"cleaner_kwargs": null,
"intent_kwargs": null,
"engineerer_kwargs": null,
"preprocessor_kwargs": null,
"reducer_kwargs": null,
"modeler_kwargs": null,
"y_var": null,
"steps": [
{
"data_cleaner": {
"n_jobs": 1,
"transformer_weights": null,
"collapse_index": true,
"_class": "CleanerMapper",
"_method": "dict",
"transformation_by_column_group": [
{
"crim": {
"memory": null,
"steps": [
{
"Flatten": {
"check_wrapped": true,
"force_reresolve": false,
"keep_columns": false,
"name": "Flatten",
"should_resolve": false,
"transformer": {
"_class": "NoTransform",
"_method": "dict"
},
"y_var": false,
"_class": "Flatten",
"_method": "dict"
}
},
{
"Cleaner": {
"check_wrapped": false,
"force_reresolve": false,
"keep_columns": false,
"name": "Cleaner",
"should_resolve": false,
"transformer": {
"_class": "NoTransform",
"_method": "dict"
},
"y_var": false,
"_class": "Cleaner",
"_method": "dict"
}
}
],
"_class": "DynamicPipeline",
"_method": "dict"
}
}
]
}
}
],
"column_sharer": {
"store": {
"domain": {
"crim": "NoTransform"
}
},
"_class": "ColumnSharer",
"_method": "dict"
},
"_class": "DataPreparer",
"_method": "dict"
}
The main challenge is to sift through and remove all the verbose output from jsonpickler
but at the same time keep all the valuable content.
Currently there are three issues:
py/tuple
. However, since we are not using it, that info is lost and we deserialize back, it is treated as a list instead of a tuple. This may or may not be an issue, depending on how the tuple is used. For example, for the quantile_range
in the RobustScaler
, it is only used to access the min/max value so tuple and list both work but it may not work in other cases. center_
. It's not part of the serialized object and thus nowhere to be found after deserialization. In other words, we are only serializing and deserializing an unfitted pipeline. Is this intentional?,
so it could get very long. Are we OK with this? For item 3: Strangely, if I re-enable feature reducer in the DataPreparer, we have all those encoded columns again in the serialized ColumnSharer. Definitely need a deep dive to understand how it works. Turn it back on and off to compare the output.
Update:
It is because of the self.check_resolve()
method of the AutoIntentMixin and FeatureEngineerer
, Preprocessor
and FeatureReducer
all implement. Since the generated columns are created after the preprocessing step, they are populated in the column_sharer only in the feature_reducer
when self.check_resolve()
is invoked.
For item 1 (scroll to the right):
E 'copy': True,
E - 'quantile_range': (25.0,
E ? ^
E + 'quantile_range': [25.0,
E ? ^
E - 75.0),
E ? ^
E + 75.0],
E ? ^
We are not serializing the fitted field, for example, the scaler has a field center_. It's not part of the serialized object and thus nowhere to be found after deserialization. In other words, we are only serializing and deserializing an unfitted pipeline. Is this intentional?
@JingJZ160 Yes, the idea is that there are three types of serialization:
We are not serializing the fitted field, for example, the scaler has a field center_. It's not part of the serialized object and thus nowhere to be found after deserialization. In other words, we are only serializing and deserializing an unfitted pipeline. Is this intentional?
@JingJZ160 Yes, the idea is that there are three types of serialization:
- User facing params, ie not the internal state (dict serialize)
- Internal state, ie save params such as center_ in pickle a. inline (in the serialized representation) b. symbolic (saved to disk with a link to the saved location)
Thanks for the clarification @adithyabsk
For item 1, we will need to implement a customized json encoder/decoder if we want to handle the tuple type serialization and deserialization: https://stackoverflow.com/questions/15721363/preserve-python-tuples-with-json
I'll create a new issue for this.
No longer a concern. We may add a task to simplify the whole serialized json or just create a python file to include the whole trained pipeline there like TPOT.
Description
The DataPreparer should be able to be serialized and deserialized. The base serialization Mixin enables this naively (i.e., by calling
.serialize(deep=True)
method on the DataPreparer). The output is currently a nested json object with many redundant/duplicate (i.e.,column_sharer
) and not so readable raw python object tag (i.e.,py/
). The goal is to implement OO serialization that enables custom filtering for each component during serialization such that it decides what should and should not be displayed.For instance,
One rule of thumb is that if there are duplication for the same instance, they should not be shown in the serialized object.
What should be included in the output
In a format that can represents nested hierarchy of the data_preprarer, the list of steps, and the transformer in each steps, we want to include all the leaf parameters (of the objects just mentioned) that have changed from their default values.
Q: Then how does user know what are the existing parameters that they can potentially tune/override if we only output changed parameters? What if they are all default values? @adithyabsk could you comment on this?
Update based on discussion with @adithyabsk :
_method
and_class
so that thedeserialization
method knows what method to deserialize the data and to what class it should deserialize it.Success Criteria
Subtasks
ConcreteSerializerMixin
on thePreparerStep
class, invokingto_json()
. Spent 2 hours on this task.from_json()
bug due toget_params()
of data_preparer, possibly blocked by #129 . Spent 1 hour just to root cause the issue.serialize()
methods in each component if certain custom filtering on what is returned by.get_params()
is required..deserialize()
methods in each component if certain custom filtering on what is returned by.get_params()
is required.TODO:
Estimate