MAIF / shapash

🔅 Shapash: User-friendly Explainability and Interpretability to Develop Reliable and Transparent Machine Learning Models
https://maif.github.io/shapash/
Apache License 2.0
2.71k stars 329 forks source link

[Bug] ValueError caused by column with nan values #554

Closed tswsxk closed 2 months ago

tswsxk commented 3 months ago

When using shapash along with the following codes:

xpl = SmartExplainer(
    model=model,
)

xpl.compile(
    x=test_df,
    ...
)

it will call the init_app in SmartApp class where the following codes are used to calculate the std of a certain column (line 192 - 199):

        for col in list(self.dataframe.columns):
            typ = self.dataframe[col].dtype
            if typ == float:
                std = self.dataframe[col].std()
                if std != 0:
                    digit = max(round(log10(1 / std) + 1) + 2, 0)
                    self.round_dataframe[col] = self.dataframe[col].map(f"{{:.{digit}f}}".format).astype(float)

However, when a column with nan values, std will be nan and execute the following line:

                    digit = max(round(log10(1 / std) + 1) + 2, 0)

and result in ValueError:

File "xxx/.local/lib/python3.10/site-packages/shapash/webapp/smart_app.py", line 197, in init_data
    digit = max(round(log10(1 / std) + 1) + 2, 0)
ValueError: cannot convert float NaN to integer

Python version : python3.10

Shapash version : shapash-2.5.0

Operating System : CentOS Linux release 8.2.2.2004

guillaume-vignal commented 2 months ago

Thank you to report this issue, if I understand correctly you should have only nan values in your columns, is it right ? Actually the code should not bug if you have a column with just some nan values in it. Can you tell us why you have such a column used in your model ? We don't see the use case. In any case we will look at your PR and see the best way to tackle this issue.

tswsxk commented 2 months ago

Hi, there. Usually, the nan columns could be removed during data preprocessing. However, in my case, the data is several time series and one column contains only nan values before a specific date. Thus, when I conduct experiments, where I need to split the data according to dates for training. However, the all-nan values in this column before a specific date cause this error.

guillaume-vignal commented 2 months ago

It has been fixed with the version 2.6.0 of shapash (https://github.com/MAIF/shapash/pull/553)