Integer feature with values 1 and 2 cannot be handled as categorical?

shgo commented 4 years ago

Hey guys, I'm getting an error when handling integer columns but the error message is not very clear for me to understand what is going on. So far it looks like a bug to me. Here it goes.

We start by importing basic stuff and generate a pandas dataframe with 4 columns containing random real numbers, plus an integer column named 'target' with values 1 and 2.

import sweetviz as sv
import pandas as pd
import numpy as np

np.random.seed(42)
np_data = np.random.randn(10, 4)
df = pd.DataFrame(np_data, columns=['col1', 'col2', 'col3', 'col4'])
df['target'] = 1.0
df['target'].iloc[5:] = 2.
df['target'] = df['target'].astype(int)

Taking a look at the original types of the dataframe (df.dtypes), we have as a result: col1 float64 col2 float64 col3 float64 col4 float64 target int32 dtype: object

Error: TypeError

compareReport = sv.compare_intra(df, df['target'] == 1, ["Complete", "Incomplete"])
compareReport.show_html()

gives this message:

TypeError                                 Traceback (most recent call last)
<ipython-input-54-8e3e89553904> in <module>
      1 #feature_config = sv.FeatureConfig(force_num=['col1', 'col2', 'col3', 'col4'], force_cat='target')
----> 2 compareReport = sv.compare_intra(df, df['target'] == 1, ["Complete", "Incomplete"])#, feat_cfg=feature_config, target_feat='target')
      3 compareReport.show_html() # Default arguments will generate to "SWEETVIZ_REPORT.html"

~\AppData\Local\Continuum\anaconda3\envs\sweetbug\lib\site-packages\sweetviz\sv_public.py in compare_intra(source_df, condition_series, names, target_feat, feat_cfg, pairwise_analysis)
     42     report = sweetviz.DataframeReport([data_true, names[0]], target_feat,
     43                                       [data_false, names[1]],
---> 44                                       pairwise_analysis, feat_cfg)
     45     return report
     46 

~\AppData\Local\Continuum\anaconda3\envs\sweetbug\lib\site-packages\sweetviz\dataframe_report.py in __init__(self, source, target_feature_name, compare, pairwise_analysis, fc)
    215             # start = time.perf_counter()
    216             self.progress_bar.set_description(':' + f.source.name + '')
--> 217             self._features[f.source.name] = sa.analyze_feature_to_dictionary(f)
    218             self.progress_bar.update(1)
    219             # print(f"DONE FEATURE------> {f.source.name}"

~\AppData\Local\Continuum\anaconda3\envs\sweetbug\lib\site-packages\sweetviz\series_analyzer.py in analyze_feature_to_dictionary(to_process)
     92         compare_type = determine_feature_type(to_process.compare,
     93                                               to_process.compare_counts,
---> 94                                               returned_feature_dict["type"], "COMPARED")
     95         if compare_type != FeatureType.TYPE_ALL_NAN and \
     96             source_type != FeatureType.TYPE_ALL_NAN:

~\AppData\Local\Continuum\anaconda3\envs\sweetbug\lib\site-packages\sweetviz\type_detection.py in determine_feature_type(series, counts, must_be_this_type, which_dataframe)
     73             var_type = FeatureType.TYPE_TEXT
     74         else:
---> 75             raise TypeError(f"Cannot force series '{series.name}' in {which_dataframe} to be from its type {var_type} to\n"
     76                             f"DESIRED type {must_be_this_type}. Check documentation for the possible coercion possibilities.\n"
     77                             f"This can be solved by changing the source data or is sometimes caused by\n"

TypeError: Cannot force series 'target' in COMPARED to be from its type FeatureType.TYPE_CAT to
DESIRED type FeatureType.TYPE_BOOL. Check documentation for the possible coercion possibilities.
This can be solved by changing the source data or is sometimes caused by
a feature type mismatch between source and compare dataframes.

If I explicitly supply the feat_cfg argument the result is the same.

feature_config = sv.FeatureConfig(force_num=['col1', 'col2', 'col3', 'col4'], force_cat='target')
compareReport = sv.compare_intra(df, df['target'] == 1, ["Complete", "Incomplete"], feat_cfg=feature_config)
compareReport.show_html() # Default arguments will generate to "SWEETVIZ_REPORT.html"

However, if I add 10 to the 'target' column (it will now have 11 and 12 as values), the report is generated without errors. Am I missing something or it is indeed a bug?

fbdesignpro commented 4 years ago

@shgo Thank you very much for the detailed writeup! Good catch! I believe it is probably an edge case of trying to automatically detect Boolean features. This is likely a bug, I will definitely check it out ASAP (it will have to be in the coming days, I am crunching at work...).

fbdesignpro commented 4 years ago

Hi @shgo! Turns out it was a good one! You are correct that the way to fix it is to force the automatically-detected boolean series to be categorical. However; there was a missing line of code to handle that case ;)

By the way using the feature config, you don't have to specify EVERY column, JUST the ones you want that are not being detected correctly or to your liking. (You might have known that, just making sure!)

SO long story short: it should be fixed in the latest 1.0beta4, let me know if that is the case. :) Thank you again!

shgo commented 4 years ago

Thanks for acting so fast @fbdesignpro!

fbdesignpro commented 4 years ago

@shgo Just to be SUPER sure, did you verify that it fixed it? :)

shgo commented 4 years ago

Just did it, but seems to hit another issue.

Error 1: TypeError

import sweetviz as sv
import pandas as pd
import numpy as np
np.random.seed(42)
np_data = np.random.randn(10, 4)
df = pd.DataFrame(np_data, columns=['col1', 'col2', 'col3', 'col4'])
df['target'] = 1.0
df['target'].iloc[5:] = 2.
df['target'] = df['target'].astype(int)
#df['target'] += 10
compareReport = sv.compare_intra(df, df['target'] == 1, ["Val1", "Val2"])
compareReport.show_html() # Default arguments will generate to "SWEETVIZ_REPORT.html"

Results in the following error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-6-d30e66adff60> in <module>
      1 #feature_config = sv.FeatureConfig(force_num=['col1', 'col2', 'col3', 'col4'], force_cat='target')
----> 2 compareReport = sv.compare_intra(df, df['target'] == 1, ["Val1", "Val2"])#, feat_cfg=feature_config)
      3 compareReport.show_html() # Default arguments will generate to "SWEETVIZ_REPORT.html"

~\AppData\Local\Continuum\anaconda3\envs\sweetbug\lib\site-packages\sweetviz\sv_public.py in compare_intra(source_df, condition_series, names, target_feat, feat_cfg, pairwise_analysis)
     42     report = sweetviz.DataframeReport([data_true, names[0]], target_feat,
     43                                       [data_false, names[1]],
---> 44                                       pairwise_analysis, feat_cfg)
     45     return report
     46 

~\AppData\Local\Continuum\anaconda3\envs\sweetbug\lib\site-packages\sweetviz\dataframe_report.py in __init__(self, source, target_feature_name, compare, pairwise_analysis, fc)
    215             # start = time.perf_counter()
    216             self.progress_bar.set_description(':' + f.source.name + '')
--> 217             self._features[f.source.name] = sa.analyze_feature_to_dictionary(f)
    218             self.progress_bar.update(1)
    219             # print(f"DONE FEATURE------> {f.source.name}"

~\AppData\Local\Continuum\anaconda3\envs\sweetbug\lib\site-packages\sweetviz\series_analyzer.py in analyze_feature_to_dictionary(to_process)
     92         compare_type = determine_feature_type(to_process.compare,
     93                                               to_process.compare_counts,
---> 94                                               returned_feature_dict["type"], "COMPARED")
     95         if compare_type != FeatureType.TYPE_ALL_NAN and \
     96             source_type != FeatureType.TYPE_ALL_NAN:

~\AppData\Local\Continuum\anaconda3\envs\sweetbug\lib\site-packages\sweetviz\type_detection.py in determine_feature_type(series, counts, must_be_this_type, which_dataframe)
     78             var_type = FeatureType.TYPE_TEXT
     79         else:
---> 80             raise TypeError(f"\nCannot convert series '{series.name}' in {which_dataframe} from its {var_type}\n"
     81                             f"to the desired type {must_be_this_type}.\nCheck documentation for the possible coercion possibilities.\n"
     82                             f"POSSIBLE RESOLUTIONS:\n"

TypeError: 
Cannot convert series 'target' in COMPARED from its TYPE_CATEGORICAL
to the desired type TYPE_BOOL.
Check documentation for the possible coercion possibilities.
POSSIBLE RESOLUTIONS:
 -> Use the feat_cfg parameter (see docs on git) to force the column to be a specific type (may or may not help depending on the type)
 -> Modify the source data to be more explicitly of a single specific type
 -> This could also be caused by a feature type mismatch between source and compare dataframes:
    In that case, make sure the source and compared data frames are compatible.

Error 2: Compiles report but does not show correct output for categorical variable

Now running with feat_cfg indicating 'target' column to be categorical:

feature_config = sv.FeatureConfig(force_cat='target')
compareReport = sv.compare_intra(df, df['target'] == 1, ["Val1", "Val2"], feat_cfg=feature_config)
compareReport.show_html() # Default arguments will generate to "SWEETVIZ_REPORT.html"

Generates the report without errors, but on the report, the section for the variable 'target' does not show the contents for the dataframe "Val2", only values 1 (from the first dataframe of the comparison).

Success: use numerical feature instead of categorical.

When running the following code, with the 'target' variable as numerical, everythin runs smoothly and the report shows results for both dataframes.

feature_config = sv.FeatureConfig(force_num='target')
compareReport = sv.compare_intra(df, df['target'] == 1, ["Val1", "Val2"], feat_cfg=feature_config)
compareReport.show_html() # Default arguments will generate to "SWEETVIZ_REPORT.html"

fbdesignpro commented 4 years ago

@shgo thank you for the detailed follow-up! This makes sense and will be fixed in the next build (which should be soon). In a nutshell:

Error 1 is unavoidable; the source contains only 1's, so is auto-detected to be boolean (I think that is fair enough), but the comparison only contains 2's (detected as categorical, which makes sense), so to avoid any ambiguities the feature_config is needed to explicitly set what is desired on a column that has so little data to make a guess as to its base type.

Error 2 is what will be fixed; a previous fix for categorical data changed the data type of the index for categorical data and caused the mismatch that made the report wrong.

I am fixing this by explicitly making all indices for distinct values to be strings so they are always compatible.

Thanks again, I will let you know when the new build is up for you to test but hopefully that's it and I will close this at that point. :)

fbdesignpro commented 4 years ago

@shgo I committed a fix that actually only sets the index to be strings when dealing with categorical/string indices already. Changing to strings all the time actually caused mismatches when dealing with integer/float indices. I'm not sure that made sense, but this should be fixed in the repository right now, and will be part of the next beta5 which should go out shortly. :) I will let you know when you can verify if it fixed your problem.

fbdesignpro commented 4 years ago

@shgo 1.0beta5 is released, the problems should be fixed! Fingers crossed! :)

fbdesignpro commented 4 years ago

@shgo just to make SUPER sure, can you confirm beta5 resolved your issue? Thank you again for all the details, I think it got us to fix this!

shgo commented 4 years ago

Hey @fbdesignpro, sure! Just tested it and everything run smoothly! Thanks for the work mate!

fbdesignpro commented 4 years ago

@shgo awesome! Thanks for checking! :)

fbdesignpro / sweetviz

Integer feature with values 1 and 2 cannot be handled as categorical? #48