Sinaptik-AI / pandas-ai

Chat with your database (SQL, CSV, pandas, polars, mongodb, noSQL, etc). PandasAI makes data analysis conversational using LLMs (GPT 3.5 / 4, Anthropic, VertexAI) and RAG.
https://pandas-ai.com
Other
12.73k stars 1.23k forks source link

enforce_privacy dose not work? #1145

Closed gDanzel closed 3 weeks ago

gDanzel commented 4 months ago

System Info

OS version: win11 Python version: 3.11 The current version of pandasai being used: 2.0.36

🐛 Describe the bug

The sample data appears in the prompt even set enforce_privacy True.

The Code below:

import pandasai.pandas as pd
from pandasai import Agent
from pandasai.helpers import get_openai_callback
from pandasai.llm import OpenAI, GoogleGemini

from data.sample_dataframe import dataframe

llm = OpenAI()

agent = Agent([pd.DataFrame(dataframe)], config={"llm": llm, "enforce_privacy": True, "verbose": True})
with get_openai_callback() as cb:
    response = agent.chat("Get the top 3 GDP countries.")
    print(response)
    print(cb)

And can see the print out of prompt, the dataframe still with data:

2024-05-04 15:08:41 [INFO] Question: Get the top 3 GDP countries.
2024-05-04 15:08:42 [INFO] Running PandasAI with openai LLM...
2024-05-04 15:08:42 [INFO] Prompt ID: 50302077-57f3-482a-a823-64e2be596f5d
2024-05-04 15:08:42 [INFO] Executing Pipeline: GenerateChatPipeline
2024-05-04 15:08:42 [INFO] Executing Step 0: ValidatePipelineInput
2024-05-04 15:08:42 [INFO] Executing Step 1: CacheLookup
2024-05-04 15:08:42 [INFO] Executing Step 2: PromptGeneration
2024-05-04 15:08:46 [INFO] Using prompt: <dataframe>
dfs[0]:10x3
country,gdp,happiness_index
Spain,19294482071552,6.38
Japan,14631844184064,7.23
China,3435817336832,7.22
</dataframe>

Update this initial code:
\```python
\# TODO: import the required dependencies
import pandas as pd

\# Write code here

\# Declare result var: 
type (possible values "string", "number", "dataframe", "plot"). Examples: { "type": "string", "value": f"The highest salary is {highest_salary}." } or { "type": "number", "value": 125 } or { "type": "dataframe", "value": pd.DataFrame({...}) } or { "type": "plot", "value": "temp_chart.png" }

### QUERY Get the top 3 GDP countries.

Variable dfs: list[pd.DataFrame] is already declared.

At the end, declare "result" variable as a dictionary of type and value.

If you are asked to plot a chart, use "matplotlib" for charts, save as png.

Generate python code and return full updated code: 2024-05-04 15:08:46 [INFO] Executing Step 3: CodeGenerator 2024-05-04 15:08:49 [INFO] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" 2024-05-04 15:08:49 [INFO] Prompt used:

dfs[0]:10x3 country,gdp,happiness_index Spain,19294482071552,6.38 Japan,14631844184064,7.23 China,3435817336832,7.22

Update this initial code:

# TODO: import the required dependencies
import pandas as pd

# Write code here

# Declare result var: 
type (possible values "string", "number", "dataframe", "plot"). Examples: { "type": "string", "value": f"The highest salary is {highest_salary}." } or { "type": "number", "value": 125 } or { "type": "dataframe", "value": pd.DataFrame({...}) } or { "type": "plot", "value": "temp_chart.png" }

QUERY

Get the top 3 GDP countries.

Variable dfs: list[pd.DataFrame] is already declared.

At the end, declare "result" variable as a dictionary of type and value.

If you are asked to plot a chart, use "matplotlib" for charts, save as png.

Generate python code and return full updated code:

2024-05-04 15:08:49 [INFO] Code generated:

            # TODO: import the required dependencies
import pandas as pd

# Write code here
top_3_gdp_countries = dfs[0].nlargest(3, 'gdp')

# Declare result var
result = {
    "type": "dataframe",
    "value": top_3_gdp_countries
}

2024-05-04 15:08:49 [INFO] Executing Step 4: CachePopulation 2024-05-04 15:08:49 [INFO] Executing Step 5: CodeCleaning 2024-05-04 15:08:49 [INFO] Code running:

top_3_gdp_countries = dfs[0].nlargest(3, 'gdp')
result = {'type': 'dataframe', 'value': top_3_gdp_countries}

2024-05-04 15:08:49 [INFO] Executing Step 6: CodeExecution 2024-05-04 15:08:49 [INFO] Executing Step 7: ResultValidation 2024-05-04 15:08:49 [INFO] Answer: {'type': 'dataframe', 'value': country gdp happiness_index 0 United States 19294482071552 6.94 9 China 14631844184064 5.12 8 Japan 4380756541440 5.87} 2024-05-04 15:08:49 [INFO] Executing Step 8: ResultParsing country gdp happiness_index 0 United States 19294482071552 6.94 9 China 14631844184064 5.12 8 Japan 4380756541440 5.87 Tokens Used: 340 Prompt Tokens: 270 Completion Tokens: 70 Total Cost (USD): $ 0.000240

Process finished with exit code 0

Hrishikesh-Dutta0078 commented 4 months ago

Facing same issue. enforce privacy is working till v2.0.28.

patlac commented 4 months ago

I think it's due to how pandasai/helpers/dataframe_serializer.py -> convert_df_to_csv() doesn't care at all about the enforce_privacy config setting, it's not checking for it, neither does it check for custom_head.

it happily just adds the details:

# Add dataframe details
dataframe_info += f"\ndfs[{extras['index']}]:{df.rows_count}x{df.columns_count}\n{df.to_csv()}"

Until this gets properly fixed, I replaced above code with:

# TEMP FIX: Do not add dataframe details
df_without_sample_data = pd.DataFrame(columns=df.pandas_df.columns)
dataframe_info += f"\ndfs[{extras['index']}]:{df.rows_count}x{df.columns_count}\n{df_without_sample_data.to_csv()}"

In contrast, pandasai/helpers/dataframe_serializer.py -> convert_df_to_json() properly checks for enforce_privacy and custom_head

Related: #1147

patlac commented 4 months ago

After some more digging, it seems you can get enforce_privacy and custom head to work by forcing it to use the YML/json serialization, you just need to specify field descriptions.

If you add field descriptions,

convert_df_to_yml() will be used if you provide field descriptions...

# If field descriptions are added always use YML. Other formats don't support field descriptions yet
   if self.field_descriptions or self.connector_relations:
        serializer = DataframeSerializerType.YML

..and then...

    def serialize(
        self,
        df: pd.DataFrame,
        extras: dict = None,
        type_: DataframeSerializerType = DataframeSerializerType.YML,
    ) -> str:
        if type_ == DataframeSerializerType.YML:
            return self.convert_df_to_yml(df, extras)
        elif type_ == DataframeSerializerType.JSON:
            return self.convert_df_to_json_str(df, extras)
        elif type_ == DataframeSerializerType.SQL:
            return self.convert_df_sql_connector_to_str(df, extras)
        else:
            return self.convert_df_to_csv(df, extras)

convert_df_to_yml() will serialize the field descriptions in YML, and internally use convert_df_to_json() to do the rest (respecting enforce_privacy and custom head.