MAIF / shapash

🔅 Shapash: User-friendly Explainability and Interpretability to Develop Reliable and Transparent Machine Learning Models
https://maif.github.io/shapash/
Apache License 2.0
2.71k stars 331 forks source link

ValueError: The condensed distance matrix must contain only finite values. #472

Open sungla55guy opened 1 year ago

sungla55guy commented 1 year ago

Hi I'm using generate report with a LGBMClassifier for a binary classification. My data has categoricals and missing values which lightgbm can handle natively. I'm able to get the dashboard to run however when I try to generate a report with the following code:

xpl.generate_report(
    output_file='report.html', 
    project_info_file='model.yml',
    x_train=X_train,
    y_train=y_train,
    y_test=y_test,
    title_story="CCA Default Risk",
    metrics=[
        {
            'path': 'sklearn.metrics.f1_score',
            'name': 'f1 score',
        },
        {
            'path': 'sklearn.metrics.balanced_accuracy',
            'name': 'Balanced Accuracy',
        },
        {
            'path': 'sklearn.metrics.roc_auc',
            'name': 'ROC AUC',
        }
    ]
)

I get the following error:

PapermillExecutionError: 
---------------------------------------------------------------------------
Exception encountered at "In [8]":
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[8], line 1
----> 1 report.display_dataset_analysis()

File ~\Miniconda3\envs\pandas2\lib\site-packages\shapash\report\project_report.py:284, in ProjectReport.display_dataset_analysis(self, global_analysis, univariate_analysis, target_analysis, multivariate_analysis)
    282 if multivariate_analysis:
    283     print_md("### Multivariate analysis")
--> 284     fig_corr = self.explainer.plot.correlations(
    285         self.df_train_test,
    286         facet_col='data_train_test',
    287         max_features=20,
    288         width=900 if len(self.df_train_test['data_train_test'].unique()) > 1 else 500,
    289         height=500,
    290     )
    291     print_html(plotly.io.to_html(fig_corr))
    292 print_md('---')

File ~\Miniconda3\envs\pandas2\lib\site-packages\shapash\explainer\smart_plotter.py:2296, in SmartPlotter.correlations(self, df, max_features, features_to_hide, facet_col, how, width, height, degree, decimals, file_name, auto_open)
   2294 if len(list_features) == 0:
   2295     top_features = compute_top_correlations_features(corr=corr, max_features=max_features)
-> 2296     corr = cluster_corr(corr.loc[top_features, top_features], degree=degree)
   2297     list_features = list(corr.columns)
   2299 fig.add_trace(
   2300     go.Heatmap(
   2301         z=corr.loc[list_features, list_features].round(decimals).values,
   (...)
   2308         hovertemplate=hovertemplate,
   2309     ), row=1, col=i+1)

File ~\Miniconda3\envs\pandas2\lib\site-packages\shapash\explainer\smart_plotter.py:2244, in SmartPlotter.correlations.<locals>.cluster_corr(corr, degree, inplace)
   2241     return corr
   2243 pairwise_distances = sch.distance.pdist(corr**degree)
-> 2244 linkage = sch.linkage(pairwise_distances, method='complete')
   2245 cluster_distance_threshold = pairwise_distances.max()/2
   2246 idx_to_cluster_array = sch.fcluster(linkage, cluster_distance_threshold, criterion='distance')

File ~\Miniconda3\envs\pandas2\lib\site-packages\scipy\cluster\hierarchy.py:1064, in linkage(y, method, metric, optimal_ordering)
   1061     raise ValueError("`y` must be 1 or 2 dimensional.")
   1063 if not np.all(np.isfinite(y)):
-> 1064     raise ValueError("The condensed distance matrix must contain only "
   1065                      "finite values.")
   1067 n = int(distance.num_obs_y(y))
   1068 method_code = _LINKAGE_METHODS[method]

ValueError: The condensed distance matrix must contain only finite values.

Python version : 3.9.16 Shapash version : 2.3.5 Operating System : Windows 10

guillaume-vignal commented 1 year ago

Thank you for reporting us this bug, we'll fix it soon. Best regards.

ThomasBouche commented 12 months ago

Hi,

We have fix this issue, you can try with the new version of shapash 2.3.7

ekamioka commented 11 months ago

Hello @ThomasBouche , thanks for working on the issue.

I am afraid the issue is still open. I have just faced the same problem using the version 2.3.7.

I guess I understood the problem. The panda DataFrame received as corr contains NaNs. Thus, pairwise_distances will results in NaNs only, which triggers the error.

Analyzing the compute_corr function that generates the corr matrix we can see that df.corr() is generating NaNs du to the presence of constant values (once the standard deviation of a column with constant values is zero, which results in a division by zero in the correlation calculation).

ThomasBouche commented 11 months ago

Hello, Do you have an example so that I can reproduce the error? I tried to create an error with constant values, but it didn't create an error.

Furthermore, in the context of a machine learning model, in what cases does a feature have constant values?

Augustlnx commented 1 month ago

Hi! I think I've run into the same issue. It seems to be triggered quite easily when there are a lot of NANs in the dataset. Are there any parameters I can set to skip this step?