Update Cache Manager in PrepareStep instead of each parallel process
Separated flatten and data clean steps
DropCleaner bug fix
SimpleImputer Renaming
Code elimination to improve code coverage
Removing ParallelProcessor and DynamicPipeline
The latest Scikit-learn provides a class ColumnTransformer, which allows parallel processing of each column.
This covers:
The parallel processing issue we had on the datasets we have tested on (except for two due to the performance issue of the data cleaner).
Eliminate non-native code, which could be error-prone.
Related file:
foreshadow/ColumnTransformerWrapper.py This is a wrapper of the actual ColumnTransformer. We need to enable it to return pandas data frame. Otherwise foreshadow steps will only return ndarray.
Every PrepareStep subclass in foreshadow. We can use foreshadow/steps/mapper.py as an example. We are now following the convention of creating sklearn compatible transformers by overriding the fit and transform methods.
Update Cache Manager in PrepareStep instead of each parallel process
Instead of writing to cache_manager in each parallel process, like the intent type, we update an attribute field in each SmartTransformer. For example, the intent_resolver now has an attribute column_intent.
Back in the IntentMapper class, we gather all column_intents from the each intent_resolver and then write into the cache_manager once.
We may need to the same for domain type in CleanerMapper.
Separated flatten and data clean steps
Due to the adoption of the ColumnTransformer, we can no longer concatenate Flatten with Cleaner in one PrepareStep. It works before because of a hack we made (code injection), which is hard to debug and maintain.
Related file:
foreshadow/steps/flattener.py
foreshadow/steps/cleaner.py
Drop Cleaner Bug
DropCleaner returns an empty data frame without specify the index, which causes index alignment issue in the downstream. Adding the original index is the fix.
Related file: foreshadow/concrete/internals/cleaners/drop.py
SimpleImputer renaming
The latest sklearn provides a more powerful imputation class SimpleImputer but it happens to collide with our SimpleImputer's name so we need to rename them.
Related file:
foreshadow/smart/all.py
All other code changes are for unit test code coverage
In order to pass code coverage requirement on the CICD pipeline, I have to delete or comment out code through out the code base. This includes deleting the old serialization and deserialization functionality, which is not adopted by users anyway.
Description
Cleaned up code version. This is to see if the code can pass the branch coverage by removing unnecessary code. Related to the PR: https://github.com/georgianpartners/foreshadow/pull/205
What have changed:
ParallelProcessor
andDynamicPipeline
PrepareStep
instead of each parallel processRemoving
ParallelProcessor
andDynamicPipeline
The latest Scikit-learn provides a class
ColumnTransformer
, which allows parallel processing of each column.This covers:
Related file:
foreshadow/ColumnTransformerWrapper.py
This is a wrapper of the actualColumnTransformer
. We need to enable it to return pandas data frame. Otherwise foreshadow steps will only returnndarray
.PrepareStep
subclass in foreshadow. We can useforeshadow/steps/mapper.py
as an example. We are now following the convention of creating sklearn compatible transformers by overriding thefit
andtransform
methods.Update Cache Manager in
PrepareStep
instead of each parallel processInstead of writing to cache_manager in each parallel process, like the intent type, we update an attribute field in each SmartTransformer. For example, the intent_resolver now has an attribute
column_intent
.Back in the
IntentMapper
class, we gather allcolumn_intent
s from the each intent_resolver and then write into thecache_manager
once.Related file:
foreshadow/steps/mapper.py
foreshadow/smart/intent_resolving/intentresolver.py
TODO:
domain
type inCleanerMapper
.Separated flatten and data clean steps
Due to the adoption of the
ColumnTransformer
, we can no longer concatenateFlatten
withCleaner
in onePrepareStep
. It works before because of a hack we made (code injection), which is hard to debug and maintain.Related file:
foreshadow/steps/flattener.py
foreshadow/steps/cleaner.py
Drop Cleaner Bug
DropCleaner
returns an empty data frame without specify the index, which causes index alignment issue in the downstream. Adding the original index is the fix.Related file:
foreshadow/concrete/internals/cleaners/drop.py
SimpleImputer renaming
The latest sklearn provides a more powerful imputation class
SimpleImputer
but it happens to collide with ourSimpleImputer
's name so we need to rename them.Related file:
All other code changes are for unit test code coverage
In order to pass code coverage requirement on the CICD pipeline, I have to delete or comment out code through out the code base. This includes deleting the old serialization and deserialization functionality, which is not adopted by users anyway.