Closed shgo closed 4 years ago
@shgo Thank you very much for the detailed writeup! Good catch! I believe it is probably an edge case of trying to automatically detect Boolean features. This is likely a bug, I will definitely check it out ASAP (it will have to be in the coming days, I am crunching at work...).
Hi @shgo! Turns out it was a good one! You are correct that the way to fix it is to force the automatically-detected boolean series to be categorical. However; there was a missing line of code to handle that case ;)
By the way using the feature config, you don't have to specify EVERY column, JUST the ones you want that are not being detected correctly or to your liking. (You might have known that, just making sure!)
SO long story short: it should be fixed in the latest 1.0beta4, let me know if that is the case. :) Thank you again!
Thanks for acting so fast @fbdesignpro!
@shgo Just to be SUPER sure, did you verify that it fixed it? :)
Just did it, but seems to hit another issue.
Error 1: TypeError
import sweetviz as sv
import pandas as pd
import numpy as np
np.random.seed(42)
np_data = np.random.randn(10, 4)
df = pd.DataFrame(np_data, columns=['col1', 'col2', 'col3', 'col4'])
df['target'] = 1.0
df['target'].iloc[5:] = 2.
df['target'] = df['target'].astype(int)
#df['target'] += 10
compareReport = sv.compare_intra(df, df['target'] == 1, ["Val1", "Val2"])
compareReport.show_html() # Default arguments will generate to "SWEETVIZ_REPORT.html"
Results in the following error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-6-d30e66adff60> in <module>
1 #feature_config = sv.FeatureConfig(force_num=['col1', 'col2', 'col3', 'col4'], force_cat='target')
----> 2 compareReport = sv.compare_intra(df, df['target'] == 1, ["Val1", "Val2"])#, feat_cfg=feature_config)
3 compareReport.show_html() # Default arguments will generate to "SWEETVIZ_REPORT.html"
~\AppData\Local\Continuum\anaconda3\envs\sweetbug\lib\site-packages\sweetviz\sv_public.py in compare_intra(source_df, condition_series, names, target_feat, feat_cfg, pairwise_analysis)
42 report = sweetviz.DataframeReport([data_true, names[0]], target_feat,
43 [data_false, names[1]],
---> 44 pairwise_analysis, feat_cfg)
45 return report
46
~\AppData\Local\Continuum\anaconda3\envs\sweetbug\lib\site-packages\sweetviz\dataframe_report.py in __init__(self, source, target_feature_name, compare, pairwise_analysis, fc)
215 # start = time.perf_counter()
216 self.progress_bar.set_description(':' + f.source.name + '')
--> 217 self._features[f.source.name] = sa.analyze_feature_to_dictionary(f)
218 self.progress_bar.update(1)
219 # print(f"DONE FEATURE------> {f.source.name}"
~\AppData\Local\Continuum\anaconda3\envs\sweetbug\lib\site-packages\sweetviz\series_analyzer.py in analyze_feature_to_dictionary(to_process)
92 compare_type = determine_feature_type(to_process.compare,
93 to_process.compare_counts,
---> 94 returned_feature_dict["type"], "COMPARED")
95 if compare_type != FeatureType.TYPE_ALL_NAN and \
96 source_type != FeatureType.TYPE_ALL_NAN:
~\AppData\Local\Continuum\anaconda3\envs\sweetbug\lib\site-packages\sweetviz\type_detection.py in determine_feature_type(series, counts, must_be_this_type, which_dataframe)
78 var_type = FeatureType.TYPE_TEXT
79 else:
---> 80 raise TypeError(f"\nCannot convert series '{series.name}' in {which_dataframe} from its {var_type}\n"
81 f"to the desired type {must_be_this_type}.\nCheck documentation for the possible coercion possibilities.\n"
82 f"POSSIBLE RESOLUTIONS:\n"
TypeError:
Cannot convert series 'target' in COMPARED from its TYPE_CATEGORICAL
to the desired type TYPE_BOOL.
Check documentation for the possible coercion possibilities.
POSSIBLE RESOLUTIONS:
-> Use the feat_cfg parameter (see docs on git) to force the column to be a specific type (may or may not help depending on the type)
-> Modify the source data to be more explicitly of a single specific type
-> This could also be caused by a feature type mismatch between source and compare dataframes:
In that case, make sure the source and compared data frames are compatible.
Error 2: Compiles report but does not show correct output for categorical variable
Now running with feat_cfg indicating 'target' column to be categorical:
feature_config = sv.FeatureConfig(force_cat='target')
compareReport = sv.compare_intra(df, df['target'] == 1, ["Val1", "Val2"], feat_cfg=feature_config)
compareReport.show_html() # Default arguments will generate to "SWEETVIZ_REPORT.html"
Generates the report without errors, but on the report, the section for the variable 'target' does not show the contents for the dataframe "Val2", only values 1 (from the first dataframe of the comparison).
Success: use numerical feature instead of categorical.
When running the following code, with the 'target' variable as numerical, everythin runs smoothly and the report shows results for both dataframes.
feature_config = sv.FeatureConfig(force_num='target')
compareReport = sv.compare_intra(df, df['target'] == 1, ["Val1", "Val2"], feat_cfg=feature_config)
compareReport.show_html() # Default arguments will generate to "SWEETVIZ_REPORT.html"
@shgo thank you for the detailed follow-up! This makes sense and will be fixed in the next build (which should be soon). In a nutshell:
Error 1 is unavoidable; the source contains only 1's, so is auto-detected to be boolean (I think that is fair enough), but the comparison only contains 2's (detected as categorical, which makes sense), so to avoid any ambiguities the feature_config is needed to explicitly set what is desired on a column that has so little data to make a guess as to its base type.
Error 2 is what will be fixed; a previous fix for categorical data changed the data type of the index for categorical data and caused the mismatch that made the report wrong.
I am fixing this by explicitly making all indices for distinct values to be strings so they are always compatible.
Thanks again, I will let you know when the new build is up for you to test but hopefully that's it and I will close this at that point. :)
@shgo I committed a fix that actually only sets the index to be strings when dealing with categorical/string indices already. Changing to strings all the time actually caused mismatches when dealing with integer/float indices. I'm not sure that made sense, but this should be fixed in the repository right now, and will be part of the next beta5 which should go out shortly. :) I will let you know when you can verify if it fixed your problem.
@shgo 1.0beta5 is released, the problems should be fixed! Fingers crossed! :)
@shgo just to make SUPER sure, can you confirm beta5 resolved your issue? Thank you again for all the details, I think it got us to fix this!
Hey @fbdesignpro, sure! Just tested it and everything run smoothly! Thanks for the work mate!
@shgo awesome! Thanks for checking! :)
Hey guys, I'm getting an error when handling integer columns but the error message is not very clear for me to understand what is going on. So far it looks like a bug to me. Here it goes.
We start by importing basic stuff and generate a pandas dataframe with 4 columns containing random real numbers, plus an integer column named 'target' with values 1 and 2.
Taking a look at the original types of the dataframe (
df.dtypes
), we have as a result: col1 float64 col2 float64 col3 float64 col4 float64 target int32 dtype: objectError: TypeError
gives this message:
If I explicitly supply the feat_cfg argument the result is the same.
However, if I add 10 to the 'target' column (it will now have 11 and 12 as values), the report is generated without errors. Am I missing something or it is indeed a bug?