Closed gDanzel closed 3 weeks ago
Facing same issue. enforce privacy is working till v2.0.28.
I think it's due to how pandasai/helpers/dataframe_serializer.py -> convert_df_to_csv() doesn't care at all about the enforce_privacy
config setting, it's not checking for it, neither does it check for custom_head
.
it happily just adds the details:
# Add dataframe details
dataframe_info += f"\ndfs[{extras['index']}]:{df.rows_count}x{df.columns_count}\n{df.to_csv()}"
Until this gets properly fixed, I replaced above code with:
# TEMP FIX: Do not add dataframe details
df_without_sample_data = pd.DataFrame(columns=df.pandas_df.columns)
dataframe_info += f"\ndfs[{extras['index']}]:{df.rows_count}x{df.columns_count}\n{df_without_sample_data.to_csv()}"
In contrast, pandasai/helpers/dataframe_serializer.py -> convert_df_to_json() properly checks for enforce_privacy
and custom_head
Related: #1147
After some more digging, it seems you can get enforce_privacy
and custom head
to work by forcing it to use the YML/json serialization, you just need to specify field descriptions.
If you add field descriptions,
convert_df_to_yml()
will be used if you provide field descriptions...
# If field descriptions are added always use YML. Other formats don't support field descriptions yet
if self.field_descriptions or self.connector_relations:
serializer = DataframeSerializerType.YML
..and then...
def serialize(
self,
df: pd.DataFrame,
extras: dict = None,
type_: DataframeSerializerType = DataframeSerializerType.YML,
) -> str:
if type_ == DataframeSerializerType.YML:
return self.convert_df_to_yml(df, extras)
elif type_ == DataframeSerializerType.JSON:
return self.convert_df_to_json_str(df, extras)
elif type_ == DataframeSerializerType.SQL:
return self.convert_df_sql_connector_to_str(df, extras)
else:
return self.convert_df_to_csv(df, extras)
convert_df_to_yml()
will serialize the field descriptions in YML, and internally use convert_df_to_json()
to do the rest (respecting enforce_privacy
and custom head
.
System Info
OS version: win11 Python version: 3.11 The current version of pandasai being used: 2.0.36
🐛 Describe the bug
The sample data appears in the prompt even set enforce_privacy True.
The Code below:
And can see the print out of prompt, the dataframe still with data:
### QUERY Get the top 3 GDP countries.
Variable
dfs: list[pd.DataFrame]
is already declared.At the end, declare "result" variable as a dictionary of type and value.
If you are asked to plot a chart, use "matplotlib" for charts, save as png.
Generate python code and return full updated code: 2024-05-04 15:08:46 [INFO] Executing Step 3: CodeGenerator 2024-05-04 15:08:49 [INFO] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" 2024-05-04 15:08:49 [INFO] Prompt used:
Update this initial code:
QUERY
Get the top 3 GDP countries.
Variable
dfs: list[pd.DataFrame]
is already declared.At the end, declare "result" variable as a dictionary of type and value.
If you are asked to plot a chart, use "matplotlib" for charts, save as png.
Generate python code and return full updated code:
2024-05-04 15:08:49 [INFO] Code generated:
2024-05-04 15:08:49 [INFO] Executing Step 4: CachePopulation 2024-05-04 15:08:49 [INFO] Executing Step 5: CodeCleaning 2024-05-04 15:08:49 [INFO] Code running:
2024-05-04 15:08:49 [INFO] Executing Step 6: CodeExecution 2024-05-04 15:08:49 [INFO] Executing Step 7: ResultValidation 2024-05-04 15:08:49 [INFO] Answer: {'type': 'dataframe', 'value': country gdp happiness_index 0 United States 19294482071552 6.94 9 China 14631844184064 5.12 8 Japan 4380756541440 5.87} 2024-05-04 15:08:49 [INFO] Executing Step 8: ResultParsing country gdp happiness_index 0 United States 19294482071552 6.94 9 China 14631844184064 5.12 8 Japan 4380756541440 5.87 Tokens Used: 340 Prompt Tokens: 270 Completion Tokens: 70 Total Cost (USD): $ 0.000240
Process finished with exit code 0