Foreshadow upgrade - Githubissues

jzhang-gp commented 4 years ago

Description

This PR aims to upgrade the dependency of Foreshadow, including Sklearn, Pandas, tpot and in turn their dependencies.

Major dependency changes:

pandas = "^0.25.0"
scikit-learn = "^0.22.1"
TPOT = "^0.11.0"

Due to these dependency changes, we touched on a large amount of files since there are new features in Sklearn we can leverage. The multiprocessing issue seems to be addressed at this moment (unless we find an even larger dataset but my suspicion is that the data cleaner has some performance issues).

The main change here is that we are using the ColumnTransformer in Sklearn v0.22. This class can replace our own version of ParallelProcessor and DynamicPipeline.

We also added a data flattening step before the data cleaning step. The old foreshadow data cleaner chains the flattener and cleaner together. In theory, the cleaner can only work for a single column, but the flattener spits out multiple. The current solution is a hack that inject customized code during runtime, which is very hard to debug and maintain.

The changes in this PR upgrades Foreshadow to the minimum working version with the latest dependencies but there are quite some improvements to make. We should remove all the redundant features/code from the code base and add unit tests back after this.

jzhang-gp commented 4 years ago

All tests have passed: https://dev.azure.com/georgianpartners/foreshadow/_testManagement/runs?runId=414&_a=resultQuery The only issue we have is the code coverage. It is now below 75%. I could start adding more tests but this will add more changes to the PR.

jzhang-gp commented 4 years ago

Discard in favor of the other PR.

georgian-io-archive / foreshadow

Foreshadow upgrade #205

Description