evidentlyai / evidently

Evidently is ​​an open-source ML and LLM observability framework. Evaluate, test, and monitor any AI-powered system or data pipeline. From tabular data to Gen AI. 100+ metrics.
https://www.evidentlyai.com/evidently-oss
Apache License 2.0
5.2k stars 586 forks source link

evidently slow in docker #1054

Open ankita2020 opened 6 months ago

ankita2020 commented 6 months ago

when running in docker it will take more time and running locally, it will create all charts in less time. total rows in prediction data - 4000 total rows in reference data 20000 numerical method-ks test, numerical thres-0.6 cat method -chi-square,categorical thres-0.4 and create 2 charts(data drift,correlation chart) I use a function

def detect_features_drift(reference, production, column_mapping, get_scores=False):
    """
    Returns True if Data Drift is detected, else returns False. 
    If get_scores is True, returns scores value (like p-value) for each feature.
    The Data Drift detection depends on the confidence level and the threshold.
    For each individual feature Data Drift is detected with the selected confidence (default value is 0.95).
    """

    data_drift_report = Report(metrics=[DataDriftPreset()])
    # print(production.head())
    data_drift_report.run(reference_data=reference, current_data=production, column_mapping=column_mapping)
    report = data_drift_report.as_dict()

    drifts = []
    num_features = column_mapping.numerical_features if column_mapping.numerical_features else []
    cat_features = column_mapping.categorical_features if column_mapping.categorical_features else []
    # print(production.columns.tolist(),'current')
    # print(report["metrics"][1]["result"]["drift_by_columns"],'report')
    for feature in num_features + cat_features:
        drift_score = report["metrics"][1]["result"]["drift_by_columns"][feature]["drift_score"]
        if get_scores:
            drifts.append((feature, drift_score))
        else:
            drifts.append((feature, report["metrics"][1]["result"]["drift_by_columns"][feature]["drift_detected"]))

    return drifts 

in my code and that function is taking lot of time .

python version 3.8

libraries-

asynch==0.2.3
certifi==2022.12.7
charset-normalizer==3.1.0
ciso8601==2.3.1
click==8.1.3
clickhouse-cityhash==1.0.2.4
clickhouse-driver==0.2.6
clickhouse-sqlalchemy==0.2.5
dnspython==2.3.0
email-validator==1.3.1
evidently==0.2.7
fastapi==0.95.0
greenlet==2.0.2
gunicorn==21.2.0
h11==0.14.0
idna==3.4
joblib==1.2.0
leb128==1.0.5
lz4==4.3.3
nltk==3.8.1
numpy==1.24.2
packaging==23.0
pandas==1.5.3
pandasql==0.7.3
patsy==0.5.3
plotly==5.14.0
psycopg2==2.9.9
pydantic==1.10.7
pyspark==3.5.0
python-dateutil==2.8.2
python-multipart==0.0.6
pytz==2023.3
PyYAML==6.0
regex==2023.3.23
requests==2.28.2
scikit-learn==1.2.2
scipy==1.10.1
six==1.16.0
sniffio==1.3.0
SQLAlchemy==1.4.52
starlette==0.26.1
statsmodels==0.13.5
tenacity==8.2.2
threadpoolctl==3.1.0
tqdm==4.65.0
typing_extensions==4.9.0
tzlocal==2.1
urllib3==1.26.15
uvicorn==0.21.1
zstd==1.5.5.1
elenasamuylova commented 6 months ago

Hi @ankita2020 - you seem to be using Evidently version 0.2.7 . The latest version is 0.4.18, and it includes various improvements, including those speeding up drift calculations. We recommend upgrading to this version - let us know if you observe any issues with it.

ankita2020 commented 6 months ago

Hi @elenasamuylova - I used the same version locally, and it still takes less time to generate all the charts. However, running the code in Docker with the same version takes approximately three times longer. If the version were the issue, it should take more time on my local machine as well. Are there any Ubuntu-specific versions required in Docker, or do you have any other suggestions? Ubuntu local version 22.04 and in docker ubuntu version 18