Sinaptik-AI / pandas-ai

Chat with your database (SQL, CSV, pandas, polars, mongodb, noSQL, etc). PandasAI makes data analysis conversational using LLMs (GPT 3.5 / 4, Anthropic, VertexAI) and RAG.
https://pandas-ai.com
Other
12.55k stars 1.21k forks source link

Issue with PandasAI: KeyError: '__import__' during Code Execution #1342

Open rakendd opened 2 weeks ago

rakendd commented 2 weeks ago

System Info

pandas 2.2.14 in python3.11.0 Ubuntu 22.04.4 LTS

๐Ÿ› Describe the bug

Hi everyone,

I'm encountering an issue while using PandasAI to generate and execute some code. However, during the execution step, I'm getting the following error:

2024-08-29 08:00:29 [ERROR] Failed with error: Traceback (most recent call last): File "/home/rakend/anaconda3/envs/test_snowflakedata/lib/python3.11/site-packages/pandasai/pipelines/chat/code_execution.py", line 85, in execute result = self.execute_code(code_to_run, code_context) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/rakend/anaconda3/envs/test_snowflakedata/lib/python3.11/site-packages/pandasai/pipelines/chat/code_execution.py", line 171, in execute_code exec(code, environment) File "", line 12, in File "pandas/_libs/tslibs/timestamps.pyx", line 1054, in pandas._libs.tslibs.timestamps._Timestamp.repr File "pandas/_libs/tslibs/timestamps.pyx", line 1496, in pandas._libs.tslibs.timestamps.Timestamp.strftime KeyError: 'import'

Code PandasAI generated and attempted to run

df = dfs[0]
df['EXECUTION_DATE'] = pd.to_datetime(df['EXECUTION_DATE'])
df['EFFICIENCY_PERCENTAGE'] = pd.to_numeric(df['EFFICIENCY_PERCENTAGE'], errors='coerce')
df['WAREHOUSE_USAGE_COST'] = pd.to_numeric(df['WAREHOUSE_USAGE_COST'], errors='coerce')
df['QUERY_EXECUTION_COST'] = pd.to_numeric(df['QUERY_EXECUTION_COST'], errors='coerce')
grouped_df = df.groupby(['EXECUTION_DATE', 'WAREHOUSE_NAME']).agg({'EFFICIENCY_PERCENTAGE': 'mean', 'WAREHOUSE_USAGE_COST': 'sum'}).reset_index()
filtered_df = grouped_df[grouped_df['WAREHOUSE_NAME'] == 'MEMBER_PROFILE_DE_WH_L']
filtered_df['WAREHOUSE_USAGE_COST_PCT_CHANGE'] = filtered_df['WAREHOUSE_USAGE_COST'].pct_change() * 100
filtered_df['EFFICIENCY_PERCENTAGE_CHANGE'] = filtered_df['EFFICIENCY_PERCENTAGE'].diff()
filtered_df['EFFICIENCY_PERCENTAGE_LABEL'] = 'Daily Average'
filtered_df['WAREHOUSE_USAGE_COST_LABEL'] = 'Total'
result = {'type': 'string', 'value': f"Filtered Data: {filtered_df.to_dict(orient='records')}"}

The error could indeed be related to how filtered_df.to_dict(orient='records') is being processed, particularly if thereโ€™s a datetime field in the DataFrame that requires conversion to a string format. If strftime is invoked during this conversion and encounters an unexpected format or an issue with the datetime object, it could trigger the KeyError: 'import'.

This is happening multiple times.

dosubot[bot] commented 2 weeks ago

Hi @rakendd ๐Ÿ‘‹,

I'm here to help you with any bugs, questions, or contributions. Let's work together to solve the issue you're facing!

I found a similar open issue that might be related to your problem: ImportError: import not found, which is closed. This issue was referenced by another issue: exec() KeyError: 'import'. The suggested solution is to ensure that the libraries you are using are whitelisted or allowed in your execution environment.

To resolve the KeyError: '__import__', you need to ensure that the code does not contain any references to dangerous built-ins, including __import__. The CodeCleaning class in PandasAI has a mechanism to check for such dangerous built-ins and remove them.

Here is the relevant part of the code that checks for these dangerous built-ins:

def _is_jailbreak(self, node: ast.stmt) -> bool:
    """
    Remove jailbreaks from the code to prevent malicious code execution.
    Args:
        node (ast.stmt): A code node to be checked.
    Returns (bool):
    """
    DANGEROUS_BUILTINS = ["__subclasses__", "__builtins__", "__import__"]

    node_str = ast.dump(node)

    return any(builtin in node_str for builtin in DANGEROUS_BUILTINS)

Ensure that the _is_jailbreak method is correctly identifying and removing any references to __import__ in the generated code. If the error persists, it might be due to the code not being properly cleaned before execution. You can debug by adding logging to check if the _is_jailbreak method is being called and if it is correctly identifying the dangerous built-ins.

Additionally, ensure that the CodeCleaning step is properly integrated into the GenerateChatPipeline:

self.code_generation_pipeline = Pipeline(
    context=context,
    logger=logger,
    query_exec_tracker=self.query_exec_tracker,
    steps=[
        ValidatePipelineInput(),
        CacheLookup(),
        PromptGeneration(
            skip_if=self.is_cached,
            on_execution=on_prompt_generation,
        ),
        CodeGenerator(
            skip_if=self.is_cached,
            on_execution=on_code_generation,
        ),
        CachePopulation(skip_if=self.is_cached),
        CodeCleaning(
            skip_if=self.no_code,
            on_failure=self.on_code_cleaning_failure,
            on_retry=self.on_code_retry,
        ),
    ],
)

By ensuring that the _is_jailbreak method is functioning correctly and that the CodeCleaning step is properly integrated into the pipeline, you should be able to prevent the KeyError: '__import__' from occurring [1][2].

To continue talking to Dosu, mention @dosu.

Blubbaa commented 1 week ago

I have the exact same issue, it seems to be triggered by modifications of datetime or other columns. The referenced file in the error File "pandas/_libs/tslibs/timestamps.pyx" contains this line, which I think could be the cause, due to the local import: https://github.com/pandas-dev/pandas/blob/db1b8ab2fc1a65c2e520dc9ffb6390743a00d63b/pandas/_libs/tslibs/timestamps.pyx#L1612C9-L1612C34

However I do not know if its really the issue, nor do I know how to handle this. Can local imports also be allowed via the custom_whitelisted_dependencies parameter?