Giskard-AI / giskard

🐢 Open-Source Evaluation & Testing for LLMs and ML models
https://docs.giskard.ai
Apache License 2.0
3.73k stars 235 forks source link

Giskard Scan crush when tested for a large number of features #1974

Open dzaridis opened 5 days ago

dzaridis commented 5 days ago

Issue Type

Bug

Source

source

Giskard Library Version

2.14.0

Giskard Hub Version

2.14.0

OS Platform and Distribution

Linux Ubuntu 20.04

Python version

3.9

Installed python packages

numpy==1.23.5
pandas==2.0.3
pyarrow==16.0.0
openpyxl==3.1.2
scikit-learn==1.3.1
xgboost==1.7.6
featurewiz==0.3.2

Current Behaviour?

I run the scan with a testing dataset of 100 samples and ~3700 features and an OOM error occured. 
I have utilized a pipeline with Data transformers, featurewiz feature selection and XGBoost model. 
I have run the library in 2 other use cases with <100 number of features and it runs smoothly without any issue therefore the issue i am suspecting is related with the vast amount of features

Running on 48GB Ram.

Standalone code OR list down the steps to reproduce the issue

import pandas as pd
import numpy as np
from giskard import Dataset, Model, scan

# Class to create the model and Dataset
class VulnerabilityDetection:
    def __init__(self, df: pd.DataFrame, model_instance):
        self.model_instance = model_instance
        self.df = df

    def gisk_dataset(self):
        CATEGORICAL_COLUMNS = list(self.df[self.df.columns[self.df.dtypes == 'object']].columns)
        giskard_dataset = Dataset(
                df=self.df,
                target="Target",
                name="",
                cat_columns=CATEGORICAL_COLUMNS,
                )
        return giskard_dataset

    def gisk_model(self):
        model_inst = self.model_instance

        def prediction_function(df: pd.DataFrame) -> np.ndarray:
            return model_inst.predict_proba(df)

        giskard_model = Model(
            model=prediction_function,
            model_type="classification",
            name="Vulnerability Detection Model",
            classification_labels=model_inst.classes_,
            feature_names=self.df.columns
        )
        return giskard_model

# Execution
import pickle
df = pd.read_csv("MyData")
with open("XGBoost_pipeline.pkl", 'rb') as file:
    xg_pipeline = pickle.load(file)
vd = VulnerabilityDetection(df, xg_pipeline)
gisk_dataset = vd.gisk_dataset()
gisk_model = vd.gisk_model()

Relevant log output

Actually the Notebook from VSCode crushed with OOM error
dzaridis commented 5 days ago

The dataset i am using is related to radiomics (Medical Imaging) where all the features are contributing at model's decision and therefore i cannot isolate specific features. Maybe updating the logic behind the scan would be beneficiary. For instance for a large number of features procced with batch processing and at the end merge the scan results into the total