ShifuML / shifu

An end-to-end machine learning and data mining framework on Hadoop
https://github.com/ShifuML/shifu/wiki
Apache License 2.0
251 stars 109 forks source link

Fix segment data missing in shifu norm #739

Closed Liu-Delin closed 3 years ago

Liu-Delin commented 3 years ago

Description

This PR is a bug fix for this one: https://github.com/ShifuML/shifu/pull/732.

We changed segment column's name from name_1 to name_seg1 in stats step. But we didn't change the logic in norm step, so we will find empty segment data in normalized data.

For example, the header is:

target_column|column_1|column_2|column_1_seg1|column_2_seg1

We will get data like below (with the bug):

1|0.5|0.1|||1.0

But we want to get data like:

1|0.5|0.1|0.2|-0.1|1.0

Therefore, I changed the logic in norm step and any other step which will use BasicUpdater.

Tests

I manually tested it with large data. It generated right data. For unit test, I didn't find related UT for BasicUpdater, I think I can add it later.