Sensitive data leakage - Githubissues

pesmeriz commented 1 month ago

System Info

OS version: MacOS Sequoia 15.0

My pyproject.toml

[project]
name = "pandasai-benchmark"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.12"
dependencies = [
    "numpy==1.26.4",
    "pandasai>=2.2.14",
    "python-decouple>=3.8",
    "pyyaml>=6.0.2",
]

🐛 Describe the bug

Using "enforce_privacy": True does not anonimize the data. Even if you use customer_head on your SmartDataframe, the Agent will always share the data within the original dataframe. My example:

from pandasai import SmartDataframe, Agent
from pandasai.llm.local_llm import LocalLLM

import pandas as pd
from pandasai.llm import BambooLLM
from decouple import config as cfg
import os

local = False 
bypass = True

if local:
    llm = LocalLLM(api_base="http://localhost:11434/v1", model="qwen2.5-coder:latest")
else:
    if bypass:
        llm = BambooLLM(api_key=cfg("BAMBOO_API_KEY"))
    else:
        os.environ["PANDASAI_API_KEY"] = cfg("BAMBOO_API_KEY")
        llm = BambooLLM()

employee_head = pd.DataFrame(
    [
        [1, "Pedro", 600, 1],
        [2, "Tone", 1200, 2],
        [3, "Turo", 900, 3],
        [4, "ks", 750, 4],
        [5, "none", 950, 2],
    ],
    columns=["id", "name", "salary", "department_id"],
)

employee = SmartDataframe(
    pd.DataFrame(
        [
            [1, "John Dow", 60000, 1],
            [2, "Jane Smith", 120000, 2],
            [3, "Taro Yamada", 90000, 3],
            [4, "Maria Silva", 75000, 4],
            [5, "Michal Johnson", 95000, 2],
        ],
        columns=["id", "name", "salary", "department_id"],
    ),
    custom_head=employee_head,
    config={"custom_head": employee_head, "llm": llm, "enable_cache": False, "enforce_privacy": True},
)

department_head = pd.DataFrame(
    [
        [1, "HR", 1000, 1],
        [2, "E", 5000, 2],
        [3, "M", 2000, 3],
        [4, "S", 3000, 4],
    ],
    columns=["id", "name", "budget", "country_id"],
)

department = SmartDataframe(
    pd.DataFrame(
        [
            [1, "Human Resources", 100000, 1],
            [2, "Engineering", 500000, 2],
            [3, "Marketing", 200000, 3],
            [4, "Sales", 300000, 4],
        ],
        columns=["id", "name", "budget", "country_id"],
    ),
    custom_head=department_head,
    config={"custom_head": department_head, "llm": llm, "enable_cache": False, "enforce_privacy": True},
)

country_head = pd.DataFrame(
    [
        [1, "United States"],
        [2, "Germany"],
        [3, "Japan"],
        [4, "Brazil"],
    ],
    columns=["id", "name"],
)

country = SmartDataframe(
    pd.DataFrame(
        [
            [1, "United States"],
            [2, "Germany"],
            [3, "Japan"],
            [4, "Brazil"],
        ],
        columns=["id", "name"],
    ),
    custom_head=country_head,
    config={"custom_head": country_head, "llm": llm, "enable_cache": False, "enforce_privacy": True},
)

agent = Agent(
    [country, employee, department],
    config={
        "llm": llm,
        "verbose": True,
        "enforce_privacy": True,
        "enable_cache": False,
    },
)
response = agent.chat(
    "show me the employees with a salary above 80k and their respective salary"
    # "pivot table of the average salary of employees cross department with country"
    # "Who earns the least out of the Germany?"
)
print(response)

You can check this on /pandasai/llm/bamboo_llm.py line 18.

from typing import Optional

from ..helpers.request import Session
from ..prompts.base import BasePrompt
from .base import LLM

class BambooLLM(LLM):
    _session: Session

    def __init__(
        self, endpoint_url: Optional[str] = None, api_key: Optional[str] = None
    ):
        self._session = Session(endpoint_url=endpoint_url, api_key=api_key)

    def call(self, instruction: BasePrompt, _context=None) -> str:
        data = instruction.to_json()
        response = self._session.post("/llm/chat", json=data)
        return response["data"]

    @property
    def type(self) -> str:
        return "bamboo_llm"

dosubot[bot] commented 1 month ago

Hey @pesmeriz! 👋 I'm here to help you out with any issues or questions you have. If you need assistance, feel free to ask!

There is a known issue with the enforce_privacy setting not working as expected. The problem is that the current implementation does not actually perform any data anonymization based on the enforce_privacy setting. The enforce_privacy property is defined and can be set, but there is no logic that uses this setting to modify or anonymize the data ^[1].

A similar issue was discussed in an open issue, which is now closed. The suggested temporary fix involves modifying the convert_df_to_csv function in pandasai/helpers/dataframe_serializer.py to avoid adding the dataframe details if enforce_privacy is set to True ^[2].

Here is the suggested temporary fix:

Replace the following code in convert_df_to_csv:

# Add dataframe details
dataframe_info += f"\ndfs[{extras['index']}]:{df.rows_count}x{df.columns_count}\n{df.to_csv()}"

With this:

# TEMP FIX: Do not add dataframe details
df_without_sample_data = pd.DataFrame(columns=df.pandas_df.columns)
dataframe_info += f"\ndfs[{extras['index']}]:{df.rows_count}x{df.columns_count}\n{df_without_sample_data.to_csv()}"

This change will prevent the original dataframe details from being included when enforce_privacy is set to True.

To continue talking to Dosu, mention @dosu.

pesmeriz commented 1 month ago

I tried the proposed solution and the problems remains.

Also, this should be made clearar on the docs, though. Otherwise people start using, find a bug (or worse, they don't), and then are left without a solution and time wasted.

Sinaptik-AI / pandas-ai

Sensitive data leakage #1374

System Info

🐛 Describe the bug