Sinaptik-AI / pandas-ai

Chat with your database (SQL, CSV, pandas, polars, mongodb, noSQL, etc). PandasAI makes data analysis conversational using LLMs (GPT 3.5 / 4, Anthropic, VertexAI) and RAG.
https://pandas-ai.com
Other
13.57k stars 1.32k forks source link

Middleware for displaying charts in a Flask or web app #304

Closed Terrevue closed 4 months ago

Terrevue commented 1 year ago

Unable to display plot in Flask app. See below for the suggested fix. Feel free to use this and close this issue. Just placing it here in case someone else needs it.

Discussed in https://github.com/gventuri/pandas-ai/discussions/303

Originally posted by **Terrevue** June 21, 2023 Wasn't sure the best place to post this so I hope here is ok. I need to display the charts in a Flask app and in order to do that, created a middleware converts the plt code to support outputting the base64 image directly to the browser. Hope this can help someone else. Make sure save_charts is set to false when using this. ```python """ Flask Charts Middleware class Middleware to handle the charts for Flask apps in PandasAI. """ from pandasai.middlewares.base import Middleware import ast from ast import * import logging class FlaskChartsMiddleware(Middleware): """Flask Charts Middleware class""" def run(self, code: str) -> str: """ Run the middleware to remove issues with displaying charts in PandasAI. Returns: str: Modified code """ if "plt.show()" in code: # PrintAllCalls().visit(ast.parse(code)) tree = ast.parse(code) transformer = TransformMatplotlibAST() new_tree = transformer.visit(tree) code = ast.unparse(new_tree) # transformer.logger.info(code) return code class TransformMatplotlibAST(NodeTransformer): def __init__(self): self.encountered_savefig = False self.savefig_added = False self.logger = logging.getLogger('flask_charts') self.logger.setLevel(logging.DEBUG) # Set the minimum logged level to DEBUG # Create a file handler handler = logging.FileHandler('flask_charts.log') handler.setLevel(logging.DEBUG) # Set the minimum logged level to DEBUG # Create a formatter and add it to the handler formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s') handler.setFormatter(formatter) # Add the handler to the logger self.logger.addHandler(handler) super().__init__() def visit_Expr(self, node): call = node.value if isinstance(call, Call) and isinstance(call.func, Attribute): attr = call.func if isinstance(attr.value, Name) and attr.value.id == "plt": if attr.attr == "plot": plot_args = ", ".join(ast.unparse(arg) for arg in call.args) return parse("\nfig = Figure()\nax = fig.subplots()\nax.plot({})".format(plot_args)).body elif attr.attr == "title": title_args = ast.unparse(call.args[0]) if call.args else '' return parse("ax.set_title({})".format(title_args)).body elif attr.attr == "xlabel": xlabel_args = ast.unparse(call.args[0]) if call.args else '' return parse("ax.set_xlabel({})".format(xlabel_args)).body elif attr.attr == "ylabel": ylabel_args = ast.unparse(call.args[0]) if call.args else '' return parse("ax.set_ylabel({})".format(ylabel_args)).body elif attr.attr == "savefig": self.encountered_savefig = True elif attr.attr == "show" or attr.attr == "close": if not self.encountered_savefig: self.encountered_savefig = True return node def visit_Module(self, node): # add necessary import statements at the start of the module import_stmts = parse("import base64\nfrom io import BytesIO\nfrom matplotlib.figure import Figure").body transformed_body = [self.visit(n) for n in node.body] if self.encountered_savefig and not self.savefig_added: transformed_body.extend(parse("\nbuf = BytesIO()\nfig.savefig(buf, format='png')\ndata = base64.b64encode(buf.getbuffer()).decode('ascii')\nprint(f'')").body) self.savefig_added = True return Module(body=import_stmts + transformed_body, type_ignores=node.type_ignores) class PrintAllCalls(NodeVisitor): def visit_Call(self, node): transformer = TransformMatplotlibAST() transformer.logger.info(ast.unparse(node)) # print(ast.unparse(node)) self.generic_visit(node) ```
gventuri commented 1 year ago

Hey @Terrevue thanks a lot for sharing and for the great job! Would you mind opening a PR with your changes?

Terrevue commented 1 year ago

Will do. I encountered an issue with other types of charts that I'm working through now. Once I have that wrapped up, I'll open a PR.

jonbiemond commented 1 year ago

If save_charts needs to be False, maybe we can raise an exception when it isn't?

Terrevue commented 1 year ago

I updated the code to account for several conditions and opened the PR. But... I'm finding the variations I'm getting from GPT are ridiculous to try and parse. The better approach is to update the prompt with strict requirements that can then be parsed. When I use this prompt, I get consistent results that are manageable, however this won't work for everyone's use case.

If asked to chart the data, your response should include a python code that charts the dataframe `df` utilizing matplotlib.

The charting code should be in the form of a function, `plot_df`, that takes in a DataFrame and a filename (optional) and returns a base64-encoded string of the plot image for use in a Flask app. If a filename is provided, the function should also save the image to that file.

The `plot_df` function should follow these steps:

1. Create a figure and axes: `fig, ax = plt.subplots()`
2. Call the plot method on the dataframe and specify the ax argument: `df.plot(ax=ax)`
3. Customize the plot as necessary, for example, set a title: `ax.set_title('My Title')`
4. If a filename is provided, save the plot to that file: `if filename: fig.savefig(filename)`
5. Convert the plot to a base64-encoded string: `buf = BytesIO(); fig.savefig(buf, format='png'); img_str = base64.b64encode(buf.getbuffer()).decode('ascii')`
6. Return the base64-encoded string: `return img_str`

I believe the better solution is twofold: 1.) Stricter prompting and 2.) allow the user to customize the prompt. I'll keep testing and open another PR if I can find something strict but generic enough to support all but the edge cases.

jonbiemond commented 1 year ago

@Terrevue What kind of variations are you seeing?

I think if we can solve them programmatically than we can have PandasAI behave more consistently and provide a better user experience.

gventuri commented 1 year ago

@jonbiemond agreed!

@Terrevue could you make an example of a query that you are running (i.e. Plot a chart of the users grouped by country) and what the expected output should be for you to consider it a success?

I'll dig deeper into it! Thanks a lot!

gventuri commented 1 year ago

P.s. @Terrevue just added a couple of tests and made sure the linting is passing

Terrevue commented 1 year ago

Here's a perfect example of what occurs without stricter prompting:

I ask this question: "Chart the total hours each month for the CostCenter US for the current fiscal year"

The following code is produced.

import pandas as pd

# convert Date column to datetime format
df['Date'] = pd.to_datetime(df['Date'])

# filter for current fiscal year (July 1, 2022 - June 30, 2023) and CostCenter US
start_date = pd.to_datetime('2022-07-01')
end_date = pd.to_datetime('2023-06-30')
us_df = df[(df['CostCenter'] == 'US') & (df['Date'] >= start_date) & (df['Date'] <= end_date)]

# group by month and sum the Hours column
monthly_hours = us_df.groupby(pd.Grouper(key='Date', freq='M'))['Hours'].sum()

# plot the results
monthly_hours.plot(kind='bar', xlabel='Month', ylabel='Total Hours', title='Total Hours per Month for CostCenter US')

So now I adjust my middleware code to look for the plot() call and all is well.

Then I use the same dataset and just change one variable:

"Chart the total hours each month for the CostCenter CD for the current fiscal year"

And the following code is produced:

import pandas as pd
import matplotlib.pyplot as plt

# convert Date column to datetime format
df['Date'] = pd.to_datetime(df['Date'])

# filter for current fiscal year (July 1, 2022 - June 30, 2023)
start_date = pd.to_datetime('2022-07-01')
end_date = pd.to_datetime('2023-06-30')
df_fy = df[(df['Date'] >= start_date) & (df['Date'] <= end_date)]

# filter for CostCenter CD
df_cc = df_fy[df_fy['CostCenter'] == 'CD']

# group by month and sum hours
df_cc_month = df_cc.groupby(pd.Grouper(key='Date', freq='M'))['Hours'].sum()

# plot bar chart
plt.bar(df_cc_month.index.strftime('%b'), df_cc_month.values)
plt.title('Total Hours per Month for CostCenter CD (FY 2023)')
plt.xlabel('Month')
plt.ylabel('Total Hours')

GPT randomly decided to use plt.bar rather than df.plot. Without tighter guardrails around the prompt, this level of variability will make consistent parsing unmanageable.

jonbiemond commented 1 year ago

@Terrevue Yes, I see what you mean. There is some related discussion in #277.

My concern with tightening up the prompt is we limit the functionality of PandasAI, for example what if a user would like to user Seaborn as their plotting library?

It wouldn't be entirely trivial, but I think we can pretty consistently identify when a plot is being generated by inspecting the expressions stated. Then we can inject plt.gcf(), from there we can be really flexible with the output, display the chart, return it as an object, convert it base64, etc.

Terrevue commented 1 year ago

I think there is a middle ground. Adding a structure to the prompt without limiting the content is a bit trickier but definitely doable. Adding structure directives doesn't have to impact content. For example:

If asked to chart the data, ensure the generated python code has the following structure:

  1. Create separate functions for the dataframe calculations and the plot display functions
  2. The dataframe calculation function should be called df_calc and should return a single dataframe
  3. The output of the df_calc should be saved to a dataframe variable called df_product
  4. The plotting functionality should be contained within a function called plot_df
  5. plot_df should take the following parameters: df, title, x_axis, y_axis, save_plot_filename ...etc

That would ensure that there is a well defined structure around the output that doesn't restrict the content. Then it would be simple to intercept the "plot_df" function call and replace it with something else. Essentially, we should direct GPT to follow best practices rather than let it freeform spaghetti code which would, in turn, open the door to far more middleware options.

gventuri commented 1 year ago

@Terrevue the downside is: how do we make the prompt more specific, considering that the prompt is not only for charts? Do you suggest we implement a CoT to figure out which prompt to use?

Terrevue commented 1 year ago

Here is what I have been working on to address this issue:

    text: str = """
Date: {today_date}
You are provided with a pandas DataFrame, 'df', with the following metadata:

Number of Rows: {num_rows}
Number of Columns: {num_columns}
Data Preview: 
{df_head}

{START_CODE_TAG}
Here is a placeholder Python code with a clear structure and naming convention based on the phases of data analysis. Your task is to generate the specific code within this template and ensure the requested python code is prefixed with {START_CODE_TAG} exactly and suffix the code with {END_CODE_TAG} exactly.

# The Python code should be structured as follows:

class DataFrameAnalysis:
    def __init__(self, df):
        self.df = df.copy()  # To ensure the original df is not modified in place
        self.df_output = []  # An array to store multiple outputs
        self.df_output.append(dict(type = "dataframe", result = self.df))  # Add the initial dataframe as the first output
        # TODO: Add additional items to the df_output list as necessary to support other output types such as plots, etc.

    # 1. Prepare: Preprocessing and cleaning data if necessary
    def prepare_data(self):
        # TODO: Insert your generated Python code here
        self.prepare_data_hook()

    # 2. Process: Manipulating data for analysis (grouping, filtering, aggregating, etc.)
    def process_data(self):
        # TODO: Insert your generated Python code here
        self.process_data_hook()

    # 3. Analyze: Conducting the actual analysis
    def analyze_data(self):
        # TODO: Insert your generated Python code here
        # If generating a plot, create a figure and axes using plt.subplots(),
        # and generate the plot on the axes object. Do not include plt.show() in this method.
        # Any output should be added to the df_output list.  

        self.analyze_data_hook()

    # 4. Output: Returning the result in a standardized format
    def output_data(self):
        # TODO: Insert your generated Python code here
        self.output_data_hook()
        # TODO: Set the result type and value in the df_output dictionary. The result could be a DataFrame, plot, etc. If returning self.df, use self.df_output[0].  

        return self.df_output

    # Hook methods
    def prepare_data_hook(self):
        pass

    def process_data_hook(self):
        pass

    def analyze_data_hook(self):
        pass

    def output_data_hook(self):
        pass

    def run(self):
        self.prepare_data()
        self.process_data()
        self.analyze_data()
        return self.output_data()

# TODO: The following code should be outside the class definition and remain unchanged
analysis = DataFrameAnalysis(df)
result = analysis.run()
{END_CODE_TAG}

Using the provided DataFrame, 'df', please generate the specific Python code to be inserted into the respective methods in order to answer the following question:
"""

This direction is working out well in testing. In its simplest form, the user can consume the result dataframe. For more flexibility, a middleware can be included that allows for modifying hooks and extending the base DataFrameAnalysis class.
Here is a sample output :

import pandas as pd
import matplotlib.pyplot as plt

class DataFrameAnalysis:
    def __init__(self, df):
        self.df = df.copy()  # To ensure the original df is not modified in place
        self.df_output = []  # An array to store multiple outputs
        self.df_output.append(dict(type = "dataframe", result = self.df))  # Add the initial dataframe as the first output
        # TODO: Add additional items to the df_output list as necessary to support other output types such as plots, etc.)

    # 1. Prepare: Preprocessing and cleaning data if necessary
    def prepare_data(self):
        self.df['Date'] = pd.to_datetime(self.df['Date'])
        self.df['Month'] = self.df['Date'].dt.month
        self.df['Year'] = self.df['Date'].dt.year
        self.df = self.df[(self.df['CostCenter'] == 'AS') & (self.df['Year'] == 2023)]
        self.prepare_data_hook()

    # 2. Process: Manipulating data for analysis (grouping, filtering, aggregating, etc.)
    def process_data(self):
        self.df = self.df.groupby('Month')['Hours'].sum().reset_index()
        self.process_data_hook()

    # 3. Analyze: Conducting the actual analysis
    def analyze_data(self):
        fig, ax = plt.subplots()
        ax.bar(self.df['Month'], self.df['Hours'])
        ax.set_xlabel('Month')
        ax.set_ylabel('Total Hours')
        ax.set_title('Total Hours per Month for CostCenter AS')
        self.df_output.append(dict(type='plot', result=fig))
        self.analyze_data_hook()

    # 4. Output: Returning the result in a standardized format
    def output_data(self):
        self.output_data_hook()
        self.df_output.append(dict(type='dataframe', result=self.df))
        return self.df_output

    # Hook methods
    def prepare_data_hook(self):
        pass

    def process_data_hook(self):
        pass

    def analyze_data_hook(self):
        pass

    def output_data_hook(self):
        pass

    def run(self):
        self.prepare_data()
        self.process_data()
        self.analyze_data()
        return self.output_data()

# TODO: The following code should be outside the class definition and remain unchanged
analysis = DataFrameAnalysis(df)
result = analysis.run()
gventuri commented 1 year ago

@Terrevue tried it out with a few prompts and the results are just outstanding. This makes it way more flexible and customizable.

We should definitely refactor the code to have this architecture in place, making sure some of the comments are more generic and less matplotlib-oriented!

But this is a super strong approach, I really love it and having it implemented will be one of our priorities šŸš€

Terrevue commented 1 year ago

Great! Here is the updated run_code to support the class generation. Rather than capture standard out it inserts the results dataframe into the environmental vars. (I'll submit a PR if needed)

    def run_code(
        self,
        code: str,
        data_frame: pd.DataFrame,
        use_error_correction_framework: bool = True,
    ) -> str:
        """
        Execute the python code generated by LLMs to answer the question
        about the input dataframe. Run the code in the current context and return the
        result.

        Args:
            code (str): Python code to execute
            data_frame (pd.DataFrame): Full Pandas DataFrame
            use_error_correction_framework (bool): Turn on Error Correction mechanism. Default to True

        Returns:
            result: The result of the code execution. The type of the result depends on the generated code.

        """

        multiple: bool = isinstance(data_frame, list)

        # Add save chart code
        if self._save_charts:
            code = add_save_chart(code, self._prompt_id, not self._verbose)

        # Get the code to run removing unsafe imports and df overwrites
        code_to_run = self._clean_code(code)
        self.last_code_executed = code_to_run
        self.log(
            f"""
Code running:

{code_to_run}

        )

        environment: dict = self._get_environment()
        environment.update(globals())  # Adds global namespace (including built-ins) to the environment

        if multiple:
            environment.update(
                {f"df{i}": dataframe for i, dataframe in enumerate(data_frame, start=1)}
            )
        else:
            environment["df"] = data_frame

        count = 0
        while count < self._max_retries:
            try:
                # Execute the code
                exec(code_to_run, environment)
                code = code_to_run
                break
            except Exception as e:
                if not use_error_correction_framework:
                    raise e

                count += 1

                code_to_run = self._retry_run_code(code, e, multiple)

        result = environment.get('result', None)

        return result
Terrevue commented 1 year ago

One minor update to the prompt. Since GPT can get lazy sometimes, place this before the class instantiation to ensure the imports happen:

TODO: Import any necessary libraries

gventuri commented 1 year ago

@Terrevue sounds great, would love it you could submit a PR with all the changes. I give it a couple of tests and seems to be working quite solidly! Super great job šŸš€

aaossa commented 10 months ago

Hi, is there any update on this? I'm working on a bot that needs to send the chart through an API, and I'm getting errors on the PandasAI end because the temp file is not found somehow. Something like this middleware would be great if included in the library šŸ˜…

aaryaJha1 commented 8 months ago

@Terrevue at which state are you using the generated class exactly? From my understanding of the code till now, different pipelines are created, and steps would be added in each of the pipelines which are logic units. Users can also add their logic unit. Will you be adding this as a logic unit, before code generation pipeline?