feature-engine / feature_engine

Feature engineering package with sklearn like functionality
https://feature-engine.trainindata.com/
BSD 3-Clause "New" or "Revised" License
1.8k stars 304 forks source link

Fix unbound local variable reference #681

Closed david-cortes closed 11 months ago

david-cortes commented 11 months ago

This fixes an error when transformers are passed variables that do not fall under the checks performed in this function. Not sure if it's the right approach though.

david-cortes commented 11 months ago

The error about unbound local variable, by the way, happens when supplying duplicated columns with missing values and using missing_values="ignore".

solegalli commented 11 months ago

Thank you @david-cortes

Could you please add a test that causes the error in question, and that is resolved by this tweak?

What do you mean by duplicated columns? 2 variables with the same column name? If this is the case, I don't think we should take care of this. It is the user's responsibility.

Or do you refer to 2 variables with different names that are otherwise identical? In this case, we need this fix.

Thank you!

david-cortes commented 11 months ago

The error happens when there are duplicated column names, regardless of whether the contents are identical or not. Example:

import pandas as pd
from feature_engine.preprocessing import MatchCategories

df = pd.DataFrame({
    "col1" : ["a", "b", "c"],
    "col2" : [1, 2, 3]
})
df.columns = ["the column"] * 2
MatchCategories().fit_transform(df)
...
     [19](file:///home/david/anaconda3/lib/python3.9/site-packages/feature_engine/variable_handling/_variable_type_checks.py?line=18) elif is_categorical(column):
     [20](file:///home/david/anaconda3/lib/python3.9/site-packages/feature_engine/variable_handling/_variable_type_checks.py?line=19)     is_cat = _is_categories_num(column) or not _is_convertible_to_dt(column)
---> [22](file:///home/david/anaconda3/lib/python3.9/site-packages/feature_engine/variable_handling/_variable_type_checks.py?line=21) return is_cat

UnboundLocalError: local variable 'is_cat' referenced before assignment

In other situations (e.g. if passing the column names as argument variables) it would throw an error about duplicated columns, but in this case it doesn't.

However, on second thought, I am thinking this is not a proper solution as then it still wouldn't throw an error about duplicated columns.

david-cortes commented 11 months ago

Closing in favor of this other PR: https://github.com/feature-engine/feature_engine/pull/686