Closed Terrevue closed 4 months ago
Hey @Terrevue thanks a lot for sharing and for the great job! Would you mind opening a PR with your changes?
Will do. I encountered an issue with other types of charts that I'm working through now. Once I have that wrapped up, I'll open a PR.
If save_charts
needs to be False, maybe we can raise an exception when it isn't?
I updated the code to account for several conditions and opened the PR. But... I'm finding the variations I'm getting from GPT are ridiculous to try and parse. The better approach is to update the prompt with strict requirements that can then be parsed. When I use this prompt, I get consistent results that are manageable, however this won't work for everyone's use case.
If asked to chart the data, your response should include a python code that charts the dataframe `df` utilizing matplotlib.
The charting code should be in the form of a function, `plot_df`, that takes in a DataFrame and a filename (optional) and returns a base64-encoded string of the plot image for use in a Flask app. If a filename is provided, the function should also save the image to that file.
The `plot_df` function should follow these steps:
1. Create a figure and axes: `fig, ax = plt.subplots()`
2. Call the plot method on the dataframe and specify the ax argument: `df.plot(ax=ax)`
3. Customize the plot as necessary, for example, set a title: `ax.set_title('My Title')`
4. If a filename is provided, save the plot to that file: `if filename: fig.savefig(filename)`
5. Convert the plot to a base64-encoded string: `buf = BytesIO(); fig.savefig(buf, format='png'); img_str = base64.b64encode(buf.getbuffer()).decode('ascii')`
6. Return the base64-encoded string: `return img_str`
I believe the better solution is twofold: 1.) Stricter prompting and 2.) allow the user to customize the prompt. I'll keep testing and open another PR if I can find something strict but generic enough to support all but the edge cases.
@Terrevue What kind of variations are you seeing?
I think if we can solve them programmatically than we can have PandasAI behave more consistently and provide a better user experience.
@jonbiemond agreed!
@Terrevue could you make an example of a query that you are running (i.e. Plot a chart of the users grouped by country
) and what the expected output should be for you to consider it a success?
I'll dig deeper into it! Thanks a lot!
P.s. @Terrevue just added a couple of tests and made sure the linting is passing
Here's a perfect example of what occurs without stricter prompting:
I ask this question: "Chart the total hours each month for the CostCenter US for the current fiscal year"
The following code is produced.
import pandas as pd
# convert Date column to datetime format
df['Date'] = pd.to_datetime(df['Date'])
# filter for current fiscal year (July 1, 2022 - June 30, 2023) and CostCenter US
start_date = pd.to_datetime('2022-07-01')
end_date = pd.to_datetime('2023-06-30')
us_df = df[(df['CostCenter'] == 'US') & (df['Date'] >= start_date) & (df['Date'] <= end_date)]
# group by month and sum the Hours column
monthly_hours = us_df.groupby(pd.Grouper(key='Date', freq='M'))['Hours'].sum()
# plot the results
monthly_hours.plot(kind='bar', xlabel='Month', ylabel='Total Hours', title='Total Hours per Month for CostCenter US')
So now I adjust my middleware code to look for the plot() call and all is well.
Then I use the same dataset and just change one variable:
"Chart the total hours each month for the CostCenter CD for the current fiscal year"
And the following code is produced:
import pandas as pd
import matplotlib.pyplot as plt
# convert Date column to datetime format
df['Date'] = pd.to_datetime(df['Date'])
# filter for current fiscal year (July 1, 2022 - June 30, 2023)
start_date = pd.to_datetime('2022-07-01')
end_date = pd.to_datetime('2023-06-30')
df_fy = df[(df['Date'] >= start_date) & (df['Date'] <= end_date)]
# filter for CostCenter CD
df_cc = df_fy[df_fy['CostCenter'] == 'CD']
# group by month and sum hours
df_cc_month = df_cc.groupby(pd.Grouper(key='Date', freq='M'))['Hours'].sum()
# plot bar chart
plt.bar(df_cc_month.index.strftime('%b'), df_cc_month.values)
plt.title('Total Hours per Month for CostCenter CD (FY 2023)')
plt.xlabel('Month')
plt.ylabel('Total Hours')
GPT randomly decided to use plt.bar rather than df.plot. Without tighter guardrails around the prompt, this level of variability will make consistent parsing unmanageable.
@Terrevue Yes, I see what you mean. There is some related discussion in #277.
My concern with tightening up the prompt is we limit the functionality of PandasAI, for example what if a user would like to user Seaborn as their plotting library?
It wouldn't be entirely trivial, but I think we can pretty consistently identify when a plot is being generated by inspecting the expressions stated. Then we can inject plt.gcf()
, from there we can be really flexible with the output, display the chart, return it as an object, convert it base64, etc.
I think there is a middle ground. Adding a structure to the prompt without limiting the content is a bit trickier but definitely doable. Adding structure directives doesn't have to impact content. For example:
If asked to chart the data, ensure the generated python code has the following structure:
That would ensure that there is a well defined structure around the output that doesn't restrict the content. Then it would be simple to intercept the "plot_df" function call and replace it with something else. Essentially, we should direct GPT to follow best practices rather than let it freeform spaghetti code which would, in turn, open the door to far more middleware options.
@Terrevue the downside is: how do we make the prompt more specific, considering that the prompt is not only for charts? Do you suggest we implement a CoT to figure out which prompt to use?
Here is what I have been working on to address this issue:
text: str = """
Date: {today_date}
You are provided with a pandas DataFrame, 'df', with the following metadata:
Number of Rows: {num_rows}
Number of Columns: {num_columns}
Data Preview:
{df_head}
{START_CODE_TAG}
Here is a placeholder Python code with a clear structure and naming convention based on the phases of data analysis. Your task is to generate the specific code within this template and ensure the requested python code is prefixed with {START_CODE_TAG} exactly and suffix the code with {END_CODE_TAG} exactly.
# The Python code should be structured as follows:
class DataFrameAnalysis:
def __init__(self, df):
self.df = df.copy() # To ensure the original df is not modified in place
self.df_output = [] # An array to store multiple outputs
self.df_output.append(dict(type = "dataframe", result = self.df)) # Add the initial dataframe as the first output
# TODO: Add additional items to the df_output list as necessary to support other output types such as plots, etc.
# 1. Prepare: Preprocessing and cleaning data if necessary
def prepare_data(self):
# TODO: Insert your generated Python code here
self.prepare_data_hook()
# 2. Process: Manipulating data for analysis (grouping, filtering, aggregating, etc.)
def process_data(self):
# TODO: Insert your generated Python code here
self.process_data_hook()
# 3. Analyze: Conducting the actual analysis
def analyze_data(self):
# TODO: Insert your generated Python code here
# If generating a plot, create a figure and axes using plt.subplots(),
# and generate the plot on the axes object. Do not include plt.show() in this method.
# Any output should be added to the df_output list.
self.analyze_data_hook()
# 4. Output: Returning the result in a standardized format
def output_data(self):
# TODO: Insert your generated Python code here
self.output_data_hook()
# TODO: Set the result type and value in the df_output dictionary. The result could be a DataFrame, plot, etc. If returning self.df, use self.df_output[0].
return self.df_output
# Hook methods
def prepare_data_hook(self):
pass
def process_data_hook(self):
pass
def analyze_data_hook(self):
pass
def output_data_hook(self):
pass
def run(self):
self.prepare_data()
self.process_data()
self.analyze_data()
return self.output_data()
# TODO: The following code should be outside the class definition and remain unchanged
analysis = DataFrameAnalysis(df)
result = analysis.run()
{END_CODE_TAG}
Using the provided DataFrame, 'df', please generate the specific Python code to be inserted into the respective methods in order to answer the following question:
"""
This direction is working out well in testing. In its simplest form, the user can consume the result dataframe. For more flexibility, a middleware can be included that allows for modifying hooks and extending the base DataFrameAnalysis class.
Here is a sample output :
import pandas as pd
import matplotlib.pyplot as plt
class DataFrameAnalysis:
def __init__(self, df):
self.df = df.copy() # To ensure the original df is not modified in place
self.df_output = [] # An array to store multiple outputs
self.df_output.append(dict(type = "dataframe", result = self.df)) # Add the initial dataframe as the first output
# TODO: Add additional items to the df_output list as necessary to support other output types such as plots, etc.)
# 1. Prepare: Preprocessing and cleaning data if necessary
def prepare_data(self):
self.df['Date'] = pd.to_datetime(self.df['Date'])
self.df['Month'] = self.df['Date'].dt.month
self.df['Year'] = self.df['Date'].dt.year
self.df = self.df[(self.df['CostCenter'] == 'AS') & (self.df['Year'] == 2023)]
self.prepare_data_hook()
# 2. Process: Manipulating data for analysis (grouping, filtering, aggregating, etc.)
def process_data(self):
self.df = self.df.groupby('Month')['Hours'].sum().reset_index()
self.process_data_hook()
# 3. Analyze: Conducting the actual analysis
def analyze_data(self):
fig, ax = plt.subplots()
ax.bar(self.df['Month'], self.df['Hours'])
ax.set_xlabel('Month')
ax.set_ylabel('Total Hours')
ax.set_title('Total Hours per Month for CostCenter AS')
self.df_output.append(dict(type='plot', result=fig))
self.analyze_data_hook()
# 4. Output: Returning the result in a standardized format
def output_data(self):
self.output_data_hook()
self.df_output.append(dict(type='dataframe', result=self.df))
return self.df_output
# Hook methods
def prepare_data_hook(self):
pass
def process_data_hook(self):
pass
def analyze_data_hook(self):
pass
def output_data_hook(self):
pass
def run(self):
self.prepare_data()
self.process_data()
self.analyze_data()
return self.output_data()
# TODO: The following code should be outside the class definition and remain unchanged
analysis = DataFrameAnalysis(df)
result = analysis.run()
@Terrevue tried it out with a few prompts and the results are just outstanding. This makes it way more flexible and customizable.
We should definitely refactor the code to have this architecture in place, making sure some of the comments are more generic and less matplotlib-oriented!
But this is a super strong approach, I really love it and having it implemented will be one of our priorities š
Great! Here is the updated run_code to support the class generation. Rather than capture standard out it inserts the results dataframe into the environmental vars. (I'll submit a PR if needed)
def run_code(
self,
code: str,
data_frame: pd.DataFrame,
use_error_correction_framework: bool = True,
) -> str:
"""
Execute the python code generated by LLMs to answer the question
about the input dataframe. Run the code in the current context and return the
result.
Args:
code (str): Python code to execute
data_frame (pd.DataFrame): Full Pandas DataFrame
use_error_correction_framework (bool): Turn on Error Correction mechanism. Default to True
Returns:
result: The result of the code execution. The type of the result depends on the generated code.
"""
multiple: bool = isinstance(data_frame, list)
# Add save chart code
if self._save_charts:
code = add_save_chart(code, self._prompt_id, not self._verbose)
# Get the code to run removing unsafe imports and df overwrites
code_to_run = self._clean_code(code)
self.last_code_executed = code_to_run
self.log(
f"""
Code running:
{code_to_run}
)
environment: dict = self._get_environment()
environment.update(globals()) # Adds global namespace (including built-ins) to the environment
if multiple:
environment.update(
{f"df{i}": dataframe for i, dataframe in enumerate(data_frame, start=1)}
)
else:
environment["df"] = data_frame
count = 0
while count < self._max_retries:
try:
# Execute the code
exec(code_to_run, environment)
code = code_to_run
break
except Exception as e:
if not use_error_correction_framework:
raise e
count += 1
code_to_run = self._retry_run_code(code, e, multiple)
result = environment.get('result', None)
return result
One minor update to the prompt. Since GPT can get lazy sometimes, place this before the class instantiation to ensure the imports happen:
@Terrevue sounds great, would love it you could submit a PR with all the changes. I give it a couple of tests and seems to be working quite solidly! Super great job š
Hi, is there any update on this? I'm working on a bot that needs to send the chart through an API, and I'm getting errors on the PandasAI end because the temp file is not found somehow. Something like this middleware would be great if included in the library š
@Terrevue at which state are you using the generated class exactly? From my understanding of the code till now, different pipelines are created, and steps would be added in each of the pipelines which are logic units. Users can also add their logic unit. Will you be adding this as a logic unit, before code generation pipeline?
Unable to display plot in Flask app. See below for the suggested fix. Feel free to use this and close this issue. Just placing it here in case someone else needs it.
Discussed in https://github.com/gventuri/pandas-ai/discussions/303