Data sharing on log publish

pesmeriz commented 2 days ago

System Info

OS version: MacOS Sequoia 15.0

My pyproject.toml

[project]
name = "pandasai-benchmark"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.12"
dependencies = [
    "numpy==1.26.4",
    "pandasai>=2.2.14",
    "python-decouple>=3.8",
    "pyyaml>=6.0.2",
]

🐛 Describe the bug

If you set the PANDASAI_API_KEY on the environment variables, the logs will be published into the server you are using, potentially leading to sharing sensitive data.

from pandasai import SmartDataframe, Agent
from pandasai.llm.local_llm import LocalLLM

import pandas as pd
from pandasai.llm import BambooLLM
from decouple import config as cfg
import os

local = False 
bypass = False

if local:
    llm = LocalLLM(api_base="http://localhost:11434/v1", model="qwen2.5-coder:latest")
else:
    if bypass:
        llm = BambooLLM(api_key=cfg("BAMBOO_API_KEY"))
    else:
        os.environ["PANDASAI_API_KEY"] = cfg("BAMBOO_API_KEY")
        llm = BambooLLM()

employee_head = pd.DataFrame(
    [
        [1, "Pedro", 600, 1],
        [2, "Tone", 1200, 2],
        [3, "Turo", 900, 3],
        [4, "ks", 750, 4],
        [5, "none", 950, 2],
    ],
    columns=["id", "name", "salary", "department_id"],
)

employee = SmartDataframe(
    pd.DataFrame(
        [
            [1, "John Dow", 60000, 1],
            [2, "Jane Smith", 120000, 2],
            [3, "Taro Yamada", 90000, 3],
            [4, "Maria Silva", 75000, 4],
            [5, "Michal Johnson", 95000, 2],
        ],
        columns=["id", "name", "salary", "department_id"],
    ),
    custom_head=employee_head,
    config={"custom_head": employee_head, "llm": llm, "enable_cache": False, "enforce_privacy": True},
)

department_head = pd.DataFrame(
    [
        [1, "HR", 1000, 1],
        [2, "E", 5000, 2],
        [3, "M", 2000, 3],
        [4, "S", 3000, 4],
    ],
    columns=["id", "name", "budget", "country_id"],
)

department = SmartDataframe(
    pd.DataFrame(
        [
            [1, "Human Resources", 100000, 1],
            [2, "Engineering", 500000, 2],
            [3, "Marketing", 200000, 3],
            [4, "Sales", 300000, 4],
        ],
        columns=["id", "name", "budget", "country_id"],
    ),
    custom_head=department_head,
    config={"custom_head": department_head, "llm": llm, "enable_cache": False, "enforce_privacy": True},
)

country_head = pd.DataFrame(
    [
        [1, "United States"],
        [2, "Germany"],
        [3, "Japan"],
        [4, "Brazil"],
    ],
    columns=["id", "name"],
)

country = SmartDataframe(
    pd.DataFrame(
        [
            [1, "United States"],
            [2, "Germany"],
            [3, "Japan"],
            [4, "Brazil"],
        ],
        columns=["id", "name"],
    ),
    custom_head=country_head,
    config={"custom_head": country_head, "llm": llm, "enable_cache": False, "enforce_privacy": True},
)

agent = Agent(
    [country, employee, department],
    config={
        "llm": llm,
        "verbose": True,
        "enforce_privacy": True,
        "enable_cache": False,
    },
)
response = agent.chat(
    "show me the employees with a salary above 80k and their respective salary"
    # "pivot table of the average salary of employees cross department with country"
    # "Who earns the least out of the Germany?"
)
print(response)

Follow the log_data variable on publish() in /pandasai/helpers/query_exec_tracker.py

def publish(self) -> None:
    """
    Publish Query Summary to remote logging server
    """
    api_key = None
    server_url = None

    if self._server_config is None:
        server_url = os.environ.get("PANDASAI_API_URL", "https://api.domer.ai")
        api_key = os.environ.get("PANDASAI_API_KEY") or None
    else:
        server_url = self._server_config.get(
            "server_url", os.environ.get("PANDASAI_API_URL", "https://api.domer.ai")
        )
        api_key = self._server_config.get(
            "api_key", os.environ.get("PANDASAI_API_KEY")
        )

    if api_key is None:
        return

    try:
        log_data = {
            "json_log": self.get_summary(),
        }

        encoder = CustomEncoder()
        ecoded_json_str = encoder.encode(log_data)

        headers = {"Authorization": f"Bearer {api_key}"}
        response = requests.post(
            f"{server_url}/api/log/add",
            json=json.loads(ecoded_json_str),
            headers=headers,
        )
        if response.status_code != 200:
            raise Exception(response.text)

        json_data = json.loads(response.text)

        if "data" in json_data and json_data["data"] is not None:
            self._last_log_id = json_data["data"]["log_id"]

    except Exception as e:
        print(f"Exception in APILogger: {e}")

dosubot[bot] commented 2 days ago

To prevent sensitive data from being shared when setting the PANDASAI_API_KEY environment variable, you can store the API key in a .env file and use a library like python-dotenv to load it. This way, the API key is not hardcoded in your script.

Here is an example of how you can do this:

Install the python-dotenv library if you haven't already:
```
pip install python-dotenv
```
Create a .env file in your project directory and add your API key to it:
```
PANDASAI_API_KEY=YOUR_API_KEY
```

Modify your script to load the environment variables from the .env file:

import os
from dotenv import load_dotenv
from pandasai import SmartDataframe, Agent
from pandasai.llm.local_llm import LocalLLM
from pandasai.llm import BambooLLM
import pandas as pd

# Load environment variables from .env file
load_dotenv()

local = False 
bypass = False

if local:
    llm = LocalLLM(api_base="http://localhost:11434/v1", model="qwen2.5-coder:latest")
else:
    if bypass:
        llm = BambooLLM(api_key=os.getenv("BAMBOO_API_KEY"))
    else:
        os.environ["PANDASAI_API_KEY"] = os.getenv("BAMBOO_API_KEY")
        llm = BambooLLM()

# Your existing DataFrame and SmartDataframe setup code here

agent = Agent(
    [country, employee, department],
    config={
        "llm": llm,
        "verbose": True,
        "enforce_privacy": True,
        "enable_cache": False,
    },
)
response = agent.chat(
    "show me the employees with a salary above 80k and their respective salary"
)
print(response)

By using this approach, you ensure that the sensitive API key is not exposed in your codebase ^[1].

Additionally, to disable logging entirely, you can set the save_logs option to False in your configuration:

config = {
    "save_logs": False
}

employee = SmartDataframe(
    pd.DataFrame(
        [
            [1, "John Dow", 60000, 1],
            [2, "Jane Smith", 120000, 2],
            [3, "Taro Yamada", 90000, 3],
            [4, "Maria Silva", 75000, 4],
            [5, "Michal Johnson", 95000, 2],
        ],
        columns=["id", "name", "salary", "department_id"],
    ),
    custom_head=employee_head,
    config={"custom_head": employee_head, "llm": llm, "enable_cache": False, "enforce_privacy": True, "save_logs": False},
)

This will disable logging for the SmartDataframe instance by ensuring that no log files are created ^[2].

To continue talking to Dosu, mention @dosu.

pesmeriz commented 1 day ago

Again, this doesn't work. I would urge the maintainers of this app to please review the support they are providing to the issues presented. A lot of them have several months with no response or with bad instruction generation by a bot.There are several bugs affecting data sharing which, without the proper disclore on the docs, might lead to potential problems especially on the EU. @dosu please tag the maintainers of this lib to this.

dosubot[bot] commented 1 day ago

This is a tough nut to crack, could really use your help on this, @gventuri.

To continue talking to Dosu, mention @dosu.

Sinaptik-AI / pandas-ai

Data sharing on log publish #1375

System Info

🐛 Describe the bug