Foreshadow upgrade cleaned up

Description

Cleaned up code version. This is to see if the code can pass the branch coverage by removing unnecessary code. Related to the PR: https://github.com/georgianpartners/foreshadow/pull/205

What have changed:

Removing ParallelProcessor and DynamicPipeline
Update Cache Manager in PrepareStep instead of each parallel process
Separated flatten and data clean steps
DropCleaner bug fix
SimpleImputer Renaming
Code elimination to improve code coverage

Removing `ParallelProcessor` and `DynamicPipeline`

The latest Scikit-learn provides a class ColumnTransformer, which allows parallel processing of each column.

This covers:

The parallel processing issue we had on the datasets we have tested on (except for two due to the performance issue of the data cleaner).
Eliminate non-native code, which could be error-prone.

Related file:

foreshadow/ColumnTransformerWrapper.py This is a wrapper of the actual ColumnTransformer. We need to enable it to return pandas data frame. Otherwise foreshadow steps will only return ndarray.
Every PrepareStep subclass in foreshadow. We can use foreshadow/steps/mapper.py as an example. We are now following the convention of creating sklearn compatible transformers by overriding the fit and transform methods.

Update Cache Manager in `PrepareStep` instead of each parallel process

Instead of writing to cache_manager in each parallel process, like the intent type, we update an attribute field in each SmartTransformer. For example, the intent_resolver now has an attribute column_intent.

Back in the IntentMapper class, we gather all column_intents from the each intent_resolver and then write into the cache_manager once.

Related file:

foreshadow/steps/mapper.py
foreshadow/smart/intent_resolving/intentresolver.py

TODO:

We may need to the same for domain type in CleanerMapper.

Separated flatten and data clean steps

Due to the adoption of the ColumnTransformer, we can no longer concatenate Flatten with Cleaner in one PrepareStep. It works before because of a hack we made (code injection), which is hard to debug and maintain.

Related file:

foreshadow/steps/flattener.py
foreshadow/steps/cleaner.py

Drop Cleaner Bug

DropCleaner returns an empty data frame without specify the index, which causes index alignment issue in the downstream. Adding the original index is the fix.

Related file: foreshadow/concrete/internals/cleaners/drop.py

SimpleImputer renaming

The latest sklearn provides a more powerful imputation class SimpleImputer but it happens to collide with our SimpleImputer's name so we need to rename them.

Related file:

foreshadow/smart/all.py

All other code changes are for unit test code coverage

In order to pass code coverage requirement on the CICD pipeline, I have to delete or comment out code through out the code base. This includes deleting the old serialization and deserialization functionality, which is not adopted by users anyway.

georgian-io-archive / foreshadow