georgian-io-archive / foreshadow

An automatic machine learning system
https://foreshadow.readthedocs.io
Apache License 2.0
29 stars 2 forks source link

Foreshadow upgrade cleaned up #206

Closed jzhang-gp closed 4 years ago

jzhang-gp commented 4 years ago

Description

Cleaned up code version. This is to see if the code can pass the branch coverage by removing unnecessary code. Related to the PR: https://github.com/georgianpartners/foreshadow/pull/205

What have changed:

Removing ParallelProcessor and DynamicPipeline

The latest Scikit-learn provides a class ColumnTransformer, which allows parallel processing of each column.

This covers:

Related file:

Update Cache Manager in PrepareStep instead of each parallel process

Instead of writing to cache_manager in each parallel process, like the intent type, we update an attribute field in each SmartTransformer. For example, the intent_resolver now has an attribute column_intent.

Back in the IntentMapper class, we gather all column_intents from the each intent_resolver and then write into the cache_manager once.

Related file:

TODO:

Separated flatten and data clean steps

Due to the adoption of the ColumnTransformer, we can no longer concatenate Flatten with Cleaner in one PrepareStep. It works before because of a hack we made (code injection), which is hard to debug and maintain.

Related file:

Drop Cleaner Bug

DropCleaner returns an empty data frame without specify the index, which causes index alignment issue in the downstream. Adding the original index is the fix.

Related file: foreshadow/concrete/internals/cleaners/drop.py

SimpleImputer renaming

The latest sklearn provides a more powerful imputation class SimpleImputer but it happens to collide with our SimpleImputer's name so we need to rename them.

Related file:

All other code changes are for unit test code coverage

In order to pass code coverage requirement on the CICD pipeline, I have to delete or comment out code through out the code base. This includes deleting the old serialization and deserialization functionality, which is not adopted by users anyway.