Sinaptik-AI / pandas-ai

Chat with your database (SQL, CSV, pandas, polars, mongodb, noSQL, etc). PandasAI makes data analysis conversational using LLMs (GPT 3.5 / 4, Anthropic, VertexAI) and RAG.
https://pandas-ai.com
Other
11.71k stars 1.09k forks source link

"enforce_privacy": True seems not work? #1074

Closed gDanzel closed 3 months ago

gDanzel commented 3 months ago

System Info

OS version: win11 Python version: 3.10 The current version of pandasai being used: v2.0.23

🐛 Describe the bug

According to the doc:

"enforce_privacy: whether to enforce privacy. Defaults to False. If set to True, PandasAI will not send any data to the LLM, but only the metadata. By default, PandasAI will send 5 samples that are anonymized to improve the accuracy of the results. "

But I tried below code, and seems it's sending the sample data as well.

import os

import pandasai.pandas as pd
from pandasai import Agent
from pandasai.helpers import get_openai_callback
from pandasai.llm import OpenAI

from data.sample_dataframe import dataframe

# Get your FREE API key signing up at https://pandabi.ai.
# You can also configure it in your .env file.
llm = OpenAI()
agent = Agent([pd.DataFrame(dataframe)], config={"llm": llm, "enforce_privacy": True, "verbose": True})
with get_openai_callback() as cb:
    response = agent.chat("Calculate the average of the gdp of north american countries")
    print(response)
    print(cb)

You can find logs below indication prompts, having the data included.

C:\Users\Danzel\PycharmProjects\Tools\venv\Scripts\python.exe C:\Users\Danzel\PycharmProjects\Tools\yunai\privacyenforced.py 
2024-03-29 21:01:11 [INFO] Question: Calculate the average of the gdp of north american countries
2024-03-29 21:01:13 [INFO] Running PandasAI with openai LLM...
2024-03-29 21:01:13 [INFO] Prompt ID: acdd16cb-ac32-4234-9bfd-da095818859b
2024-03-29 21:01:13 [INFO] Executing Pipeline: GenerateChatPipeline
2024-03-29 21:01:13 [INFO] Executing Step 0: ValidatePipelineInput
2024-03-29 21:01:13 [INFO] Executing Step 1: CacheLookup
2024-03-29 21:01:13 [INFO] Executing Step 2: PromptGeneration
2024-03-29 21:01:19 [INFO] Using prompt: dfs[0]:
  name: null
  description: null
  type: pd.DataFrame
  rows: 10
  columns: 3
  schema:
    fields:
    - name: country
      type: object
      samples:
      - Australia
      - Japan
      - Germany
    - name: gdp
      type: object
      samples:
      - '9708343218'
      - '2035311369'
      - '9642580320'
    - name: happiness_index
      type: float64
      samples:
      - 7.16
      - 7.07
      - 7.22

Update this initial code:
```python
# TODO: import the required dependencies
import pandas as pd

# Write code here

# Declare result var: 
type (possible values "string", "number", "dataframe", "plot"). Examples: { "type": "string", "value": f"The highest salary is {highest_salary}." } or { "type": "number", "value": 125 } or { "type": "dataframe", "value": pd.DataFrame({...}) } or { "type": "plot", "value": "temp_chart.png" }

QUERY

Calculate the average of the gdp of north american countries

Variable dfs: list[pd.DataFrame] is already declared.

At the end, declare "result" variable as a dictionary of type and value.

If you are asked to plot a chart, use "matplotlib" for charts, save as png.

Generate python code and return full updated code: 2024-03-29 21:01:19 [INFO] Executing Step 3: CodeGenerator 2024-03-29 21:01:22 [INFO] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" 2024-03-29 21:01:22 [INFO] Prompt used: dfs[0]: name: null description: null type: pd.DataFrame rows: 10 columns: 3 schema: fields:

Update this initial code:

# TODO: import the required dependencies
import pandas as pd

# Write code here

# Declare result var: 
type (possible values "string", "number", "dataframe", "plot"). Examples: { "type": "string", "value": f"The highest salary is {highest_salary}." } or { "type": "number", "value": 125 } or { "type": "dataframe", "value": pd.DataFrame({...}) } or { "type": "plot", "value": "temp_chart.png" }

QUERY

Calculate the average of the gdp of north american countries

Variable dfs: list[pd.DataFrame] is already declared.

At the end, declare "result" variable as a dictionary of type and value.

If you are asked to plot a chart, use "matplotlib" for charts, save as png.

Generate python code and return full updated code:

2024-03-29 21:01:22 [INFO] Code generated:

            # TODO: import the required dependencies
import pandas as pd

# Extract the GDP values for North American countries
north_american_countries = ['United States', 'Canada', 'Mexico']
gdp_values = []
for df in dfs:
    if 'country' in df.columns and 'gdp' in df.columns:
        for index, row in df.iterrows():
            if row['country'] in north_american_countries:
                gdp_values.append(float(row['gdp']))

# Calculate the average GDP of North American countries
average_gdp = sum(gdp_values) / len(gdp_values)

# Declare result variable
result = {
    "type": "number",
    "value": average_gdp
}

2024-03-29 21:01:22 [INFO] Executing Step 4: CachePopulation 2024-03-29 21:01:22 [INFO] Executing Step 5: CodeCleaning 2024-03-29 21:01:22 [INFO] Code running:

north_american_countries = ['United States', 'Canada', 'Mexico']
gdp_values = []
for df in dfs:
    if 'country' in df.columns and 'gdp' in df.columns:
        for index, row in df.iterrows():
            if row['country'] in north_american_countries:
                gdp_values.append(float(row['gdp']))
average_gdp = sum(gdp_values) / len(gdp_values)
result = {'type': 'number', 'value': average_gdp}

2024-03-29 21:01:22 [INFO] Executing Step 6: CodeExecution 2024-03-29 21:01:22 [INFO] Executing Step 7: ResultValidation 2024-03-29 21:01:22 [INFO] Answer: {'type': 'number', 'value': 10450942230528.0} 2024-03-29 21:01:22 [INFO] Executing Step 8: ResultParsing 10450942230528.0 Tokens Used: 510 Prompt Tokens: 359 Completion Tokens: 151 Total Cost (USD): $ 0.000406

dudesparsh commented 3 months ago

@gDanzel , can you check if the data it is sending is the same as the real data?

I tried to replicate the same. As you mentioned I also see, it passes 3 sample rows to the LLM. However, on thorough check, the values in those rows are randomized for me. The numerical values doesn't equal to the real values I have passed to the LLM. Maybe the way pandas-ai is enforcing privacy, is by randomizing the values when being passed to the LLM. Let me know your findings.