Sinaptik-AI / pandas-ai

Chat with your database (SQL, CSV, pandas, polars, mongodb, noSQL, etc). PandasAI makes data analysis conversational using LLMs (GPT 3.5 / 4, Anthropic, VertexAI) and RAG.
https://pandas-ai.com
Other
12.55k stars 1.21k forks source link

Skill name is not defined #1294

Closed WojtAcht closed 2 weeks ago

WojtAcht commented 1 month ago

System Info

OS version: macOS 14.5 Python version: Python 3.10.7 The current version of pandasai being used: 2.2.12

🐛 Describe the bug

Bug: Skill Calculations Fail in PandasAI

Issue Description

Skills that perform calculations are failing with a NameError: name '<skill>' is not defined error. This occurs because the _extract_fix_dataframe_redeclarations method executes code in an environment that lacks skill definitions.

Root Cause

The _extract_fix_dataframe_redeclarations method uses an environment created by get_environment(), which does not include skill definitions:

def _extract_fix_dataframe_redeclarations(
        self, node: ast.AST, code_lines: list[str]
    ) -> ast.AST:
    # ...
    code = "\n".join(code_lines)
    env = get_environment(self._additional_dependencies)
    env["dfs"] = copy.deepcopy(self._get_originals(self._dfs))
    exec(code, env)
    # ...

The get_environment() function returns a dictionary with pandas, matplotlib, numpy, and some whitelisted builtins, but no skills:

def get_environment(additional_deps: List[dict]) -> dict:
    return {
        "pd": pd,
        "plt": plt,
        "np": np,
        # Additional dependencies and whitelisted builtins...
    }

Contrast with Correct Implementation

In contrast, the execute_code method in the CodeExecution class correctly adds skills to the environment:

def execute_code(self, code: str, context: ExecutionContext):
    # ...
    if context.skills_manager.used_skills:
        for skill_func_name in context.skills_manager.used_skills:
            skill = context.skills_manager.get_skill_by_func_name(skill_func_name)
            environment[skill_func_name] = skill
    # ...

Proposed Solution

To fix this issue, the _extract_fix_dataframe_redeclarations method should be updated to include skill definitions in its execution environment, similar to the execute_code method.

Example

import os

import pandas as pd

from pandasai import Agent
from pandasai.skills import skill
from pandasai.llm import OpenAI

employees_data = {
    "EmployeeID": [1, 2, 3, 4, 5],
    "Name": ["John", "Emma", "Liam", "Olivia", "William"],
    "Department": ["HR", "Sales", "IT", "Marketing", "Finance"],
}

salaries_data = {
    "EmployeeID": [1, 2, 3, 4, 5],
    "Salary": [5000, 6000, 4500, 7000, 5500],
}

employees_df = pd.DataFrame(employees_data)
salaries_df = pd.DataFrame(salaries_data)

# Add function docstring to give more context to model
@skill
def plot_salaries(names: list[str], salaries: list[int]):
    """
    Displays the bar chart having name on x-axis and salaries on y-axis using matplotlib
    Args:
        names (list[str]): Employees' names
        salaries (list[int]): Salaries
    """
    import matplotlib.pyplot as plt

    plt.bar(names, salaries)
    plt.xlabel("Employee Name")
    plt.ylabel("Salary")
    plt.title("Employee Salaries")
    plt.xticks(rotation=45)

@skill
def calculate_salary_betas(salaries: list[int]) -> list[float]:
    """
    Calculates the betas (25th, 50th and 75th percentiles) of salaries.

    Args:
        salaries (list[int]): List of employee salaries

    Returns:
        list[float]: A list containing the 25th, 50th, and 75th percentiles
    """
    import numpy as np

    percentiles = np.percentile(salaries, [25, 50, 75])
    return percentiles.tolist()

# By default, unless you choose a different LLM, it will use BambooLLM.
# You can get your free API key signing up at https://pandabi.ai (you can also configure it in your .env file)
llm = OpenAI(
    api_token=os.getenv("OPENAI_API_KEY"), temperature=0, seed=26, model="gpt-4o"
)
agent = Agent(
    [employees_df, salaries_df],
    config={"llm": llm, "enforce_privacy": True},
    memory_size=10,
)
agent.add_skills(plot_salaries, calculate_salary_betas)

# Chat with the agent
response = agent.chat("Create a table with salary betas")

Error:

Traceback (most recent call last):
  File "pandas-ai/pandasai/pipelines/chat/code_cleaning.py", line 95, in execute
    code_to_run = self.get_code_to_run(input, code_context)
  File "pandas-ai/pandasai/pipelines/chat/code_cleaning.py", line 152, in get_code_to_run
    code_to_run = self._clean_code(code, context)
  File "pandas-ai/pandasai/pipelines/chat/code_cleaning.py", line 515, in _clean_code
    self._extract_fix_dataframe_redeclarations(node, clean_code_lines)
  File "pandas-ai/pandasai/pipelines/chat/code_cleaning.py", line 420, in _extract_fix_dataframe_redeclarations
    exec(code, env)
  File "<string>", line 5, in <module>
NameError: name 'calculate_salary_betas' is not defined
dosubot[bot] commented 1 month ago

Hey @WojtAcht! I'm here to help you out with any issues you're facing. Let's tackle this bug together and get it resolved!

To resolve the NameError: name 'calculate_salary_betas' is not defined error, you need to update the _extract_fix_dataframe_redeclarations method to include skill definitions in its execution environment. Here is the updated method:

def _extract_fix_dataframe_redeclarations(
        self, node: ast.AST, code_lines: list[str]
    ) -> ast.AST:
    if isinstance(node, ast.Assign):
        target_names, is_slice, target = self._get_target_names(node.targets)

        if target_names and self._check_is_df_declaration(node):
            # Construct dataframe from node
            code = "\n".join(code_lines)
            env = get_environment(self._additional_dependencies)
            env["dfs"] = copy.deepcopy(self._get_originals(self._dfs))

            # Include skills in the environment
            for skill_name, skill_func in self.skills_manager.skills.items():
                env[skill_name] = skill_func

            exec(code, env)

            df_generated = (
                env[target_names[0]][target.slice.value]
                if is_slice
                else env[target_names[0]]
            )

            # check if exists in provided dfs
            for index, df in enumerate(self._dfs):
                head = df.get_head()
                if head.shape == df_generated.shape and head.columns.equals(
                    df_generated.columns
                ):
                    target_var = (
                        ast.Subscript(
                            value=ast.Name(id=target_names[0], ctx=ast.Load()),
                            slice=target.slice,
                            ctx=ast.Store(),
                        )
                        if is_slice
                        else ast.Name(id=target_names[0], ctx=ast.Store())
                    )
                    return ast.Assign(
                        targets=[target_var],
                        value=ast.Subscript(
                            value=ast.Name(id="dfs", ctx=ast.Load()),
                            slice=ast.Index(value=ast.Num(n=index)),
                            ctx=ast.Load(),
                        ),
                    )
    return None

This modification ensures that the skills managed by SkillsManager are included in the execution environment, thus avoiding the NameError [1].

To continue talking to Dosu, mention @dosu.