fbdesignpro / sweetviz

Visualize and compare datasets, target values and associations, with one line of code.
MIT License
2.9k stars 273 forks source link

Wrong values of % target #116

Closed sebastien-foulle closed 9 months ago

sebastien-foulle commented 2 years ago

Hello,

the html report produced by the following script shows that if bill_length_mm <= 35 then % target < 90%, and if 35 <= bill_length_mm <= 37.5 then % target > 105% (!).

image

import pandas as pd
from palmerpenguins import load_penguins
import sweetviz as sv
penguins = load_penguins()
penguins["target"] = penguins.species == 'Adelie'
penguins = penguins[["species", "bill_length_mm", "target"]]
penguins.head()

my_report = sv.analyze(penguins, target_feat = "target")
my_report.show_html()

But in fact if bill_length_mm <= 40, % target should always be 100% : there are only Adelie penguins in this case.

# Adelie    100
penguins.query('bill_length_mm <= 40').species.value_counts()

Maybe it's a rounding problem.

fbdesignpro commented 2 years ago

@sebastien-foulle thank you for reporting this, I will take a look!

makotu1208 commented 2 years ago

I am experiencing a same event. How is the progress of the investigation and fix here?

cwzkevin commented 1 year ago

I have a similar issue! Attached is the example_data.pkl file, example_data.pkl.zip

The code to reproduce the result:

feature_config = sv.FeatureConfig(force_cat=['numerical_var'])
correct_report = sv.analyze([example_data, 'Train'],
                             target_feat='outcome', 
                             feat_cfg=feature_config,
                             pairwise_analysis='off')
correct_report.show_html('correct_report.html')

feature_config = sv.FeatureConfig(force_num=['numerical_var'])
wrong_report = sv.analyze([example_data, 'Train'],
                           target_feat='outcome', 
                           feat_cfg=feature_config,
                           pairwise_analysis='off')
wrong_report.show_html('wrong_report.html')

When we force_cat the numerical_var, we can get the correct distribution of the outcome:

correct_need_to_force_cat

If we force_num the numerical_var, the outcome distribution is completely off:

wrong_as_numerical

fbdesignpro commented 9 months ago

Fixed by 2ec0848!