ShifuML / shifu

An end-to-end machine learning and data mining framework on Hadoop
https://github.com/ShifuML/shifu/wiki
Apache License 2.0
251 stars 109 forks source link

Fix duplicated column name issue #732

Closed Liu-Delin closed 3 years ago

Liu-Delin commented 3 years ago

Description

1. Fix duplicated column name which exists in raw data header.

This change will affect shifu init action. Old logic: if the raw data's header have duplicated column name, it will add _[columnIndex] suffix to avoid column duplicated. For example, raw data header is:

0: name
1: name_3
2. name_dup3
3: name

Because the column name of index 0 and 3 are name, so we changed the index 3 column to name_3:

0: name
1: name_3
2: name_dup3
3: name_3

But the index 1 and 3 are now conflict.

New logic: we change the duplicated name to add _dup[columnIndex] suffix. If it is still conflict with some column, we will change it to_dup[columnIndex]_1, _dup[columnIndex]_2, etc. With above example, it will change the column name to:

0: name
1: name_3
2: name_dup3
3: name_dup3_1

2. Fix duplicated column name due to multi-segment feature.

This change will affect shifu stas action. Old logic: we will add suffix for segement column with _[segementIndex]. For example, column names are below:

0: name_1
1: name
2: name_seg1

After adding segment columns, the column name will be changed to:

0: name_1
1: name
2: name_seg1
3: name_1_1
4: name_1
5: name_seg1_1

We can find that the index 0 and 4 are conflict with name_1.

New logic: we will add suffix for segement column with _seg[segmentIndex]. If it still conflict with other columns, it will be changed to _seg[segmentIndex]_1, _seg[segmentIndex]_2, etc. With above example, it will change the columns to:

0: name_1
1: name
2: name_seg1
3: name_1_seg1
4: name_seg1_1
5: name_seg1_seg1

3. Fix a compatibility issue for shifu-tensorflow.

Add method HDFSUtils.getFS(), becuase shifu-tensorflow depends on this method.

Tests

  1. Unit tests added.
  2. Manually tested shifu new, shifu init, shifu stats and shifu norm.