alteryx / featuretools

An open source python library for automated feature engineering
https://www.featuretools.com
BSD 3-Clause "New" or "Revised" License
7.24k stars 879 forks source link

Stacking certain primitives generates a warning indicating a bug - but the behavior is expected #1068

Open thehomebrewnerd opened 4 years ago

thehomebrewnerd commented 4 years ago

Using certain combinations of primitives during dfs can result in a warning being raised. The warning message indicates there is likely a bug, but the behavior is actually expected based on these primitives. An example of the warning is shown below:

featuretools - WARNING    Attempting to add feature <Feature: val2 / 1 / val2> which is already present. This is likely a bug.

The warning is easily reproduced by using these feature groups, assuming the same scalar values are used in the primitives that utilize a scalar: ['divide_numeric_scalar', 'divide_by_feature', 'divide_numeric'] or ['modulo_numeric_scalar', 'modulo_by_feature', 'modulo_numeric'] with max_depth=2.

The reason for this warning is raised is that there are multiple paths to arrive at the same feature with these groups of primitives.

For example the feature val2 / 1 / val2 can be created by first creating val2 / 1 with divide_numeric_scalar then dividing that by val2 with divide_numeric. The second path is by first creating 1 / val2 with divide_by_feature and then dividing val2 by that feature with divide_numeric.

Ideally we would update the process of generating features to avoid raising this warning in situations where we expect these duplicates to occur - or possibly avoid stacking features that create this situation.

Other options to consider:

This issue is closely related to #832 and both potentially could be fixed together.

The following code can be used to reproduce this warning:

import pandas as pd
import featuretools as ft

es = ft.EntitySet('es')

df = pd.DataFrame({
    'index': [0, 1, 2],
    'val1': [1, 2, 1],
    'val2': [10, 20, 30],
})

es.entity_from_dataframe(dataframe=df, entity_id='entity', index='index')

primitives = ft.list_primitives()
trans_primitives = ['divide_numeric_scalar', 'divide_by_feature', 'divide_numeric']
agg_primitives = []

fm, features = ft.dfs(entityset=es,
               target_entity='entity',
               trans_primitives=trans_primitives,
               agg_primitives=agg_primitives,
               max_depth=2)
tyler3991 commented 3 years ago

Is this still an issue? Iceboxing for now.

thehomebrewnerd commented 3 years ago

@tyler3991 Yes, I believe this is still an issue, but a low priority one. Iceboxing is fine.