fbdesignpro / sweetviz

Visualize and compare datasets, target values and associations, with one line of code.
MIT License
2.94k stars 277 forks source link

ValueError when analyzing with numeric target and other categorical feature #65

Closed Mol1hua closed 3 years ago

Mol1hua commented 4 years ago

Hello,

I love trying out the sweetviz library! I am using a numeric target variable "dummy_overall_status" and have some categorical features in the data set, e.g. "test_device". Unfortunately, when I run

my_report = sv.analyze(df, pairwise_analysis = "off", target_feat = "dummy_overall_status")

I get the following error message for my categorical variable "test_device":

ValueError                                Traceback (most recent call last)
<ipython-input-139-6684b3ff344b> in <module>
      3 #analyzing the dataset
      4 #rt_funktionstest_report = sv.analyze(df, target_feat="dummy_overall_status", pairwise_analysis = "off")
----> 5 rt_cb_funktionstest_report = sv.analyze(df, pairwise_analysis = "off", target_feat = "dummy_overall_status")

~\Anaconda3\lib\site-packages\sweetviz\sv_public.py in analyze(source, target_feat, feat_cfg, pairwise_analysis)
     10             feat_cfg: FeatureConfig = None,
     11             pairwise_analysis: str = 'auto'):
---> 12     report = sweetviz.DataframeReport(source, target_feat, None,
     13                                       pairwise_analysis, feat_cfg)
     14     return report

~\Anaconda3\lib\site-packages\sweetviz\dataframe_report.py in __init__(self, source, target_feature_name, compare, pairwise_analysis, fc)
    217             # start = time.perf_counter()
    218             self.progress_bar.set_description(':' + f.source.name + '')
--> 219             self._features[f.source.name] = sa.analyze_feature_to_dictionary(f)
    220             self.progress_bar.update(1)
    221             # print(f"DONE FEATURE------> {f.source.name}"

~\Anaconda3\lib\site-packages\sweetviz\series_analyzer.py in analyze_feature_to_dictionary(to_process)
    136         sweetviz.series_analyzer_numeric.analyze(to_process, returned_feature_dict)
    137     elif returned_feature_dict["type"] == FeatureType.TYPE_CAT:
--> 138         sweetviz.series_analyzer_cat.analyze(to_process, returned_feature_dict)
    139     elif returned_feature_dict["type"] == FeatureType.TYPE_BOOL:
    140         sweetviz.series_analyzer_cat.analyze(to_process, returned_feature_dict)

~\Anaconda3\lib\site-packages\sweetviz\series_analyzer_cat.py in analyze(to_process, feature_dict)
    143     do_detail_categorical(to_process, feature_dict)
    144 
--> 145     feature_dict["minigraph"] = GraphCat("mini", to_process)
    146     feature_dict["detail_graphs"] = list()
    147     feature_dict["detail_graphs"].append(GraphCat("detail", to_process))

~\Anaconda3\lib\site-packages\sweetviz\graph_cat.py in __init__(self, which_graph, to_process)
    190                             ~to_process.source.isin(names_excluding_others)])[0]
    191                     else:
--> 192                         tick_num = sv_math.count_fraction_of_true(to_process.source_target[ \
    193                             to_process.source == name])[0]
    194                     target_values_source.append(tick_num)

~\Anaconda3\lib\site-packages\sweetviz\sv_math.py in count_fraction_of_true(series)
      6     # We are assuming this is called by a Boolean series
      7     if series.dtype != np.bool:
----> 8         raise ValueError
      9     num_true = series.sum()
     10     total = float(series.count())

ValueError: 

followed by a class distribution barchart.

This error does not occur when I run "analyze" without the target_feat parameter! It looks like the function wrongly assumes that test_device is a boolean series, but it contains only strings (no NaN either).

Is there a workaround? Thank you very much!

Mol1hua commented 4 years ago

I got it to run for now, I noticed I had NaN in the target variable! I still don't understand how that caused the error above, but I am happy it is running now. :-)

fbdesignpro commented 3 years ago

Hey @Mol1hua! Thank you so much for the report! Apologies for the delay in answering; it's been a weird month on my side.

I did take a look at the issue and did a couple of fixes. However, the more I dug in, the more I realized that having NaN fields in the target variable potentially lead to confusion for the user.

e.g. how to interpret target distribution if, say, 60% of the target data was missing? I fear this would lead to people quickly looking at a graph to make generalizations about the target without realizing a ton of data is missing. I know this happens with "regular" features as well, but missing data is outlined much more clearly in those cases and it's hard to do the same for target data in every graph.

So, I am leaning towards not allowing target analysis unless there is no missing data (or 100% of the compared target), so target interpretation is unambiguous. This will be in the next version.

Again I thank you for your detailed reports and if you have any further comments on this don't hesitate to let me know.