Code generated can overrides the df

ArhamCMB commented 1 year ago

Hello, I am trying to understand the output of a test run.

Here is my code:

pandas_ai = PandasAI(llm, verbose=True)
pandas_ai.run(
    df,
    "Plot a horizontal bar chart, showing GDP of each country per bar, sorted in decending order by GDP. Each bar should be a different color.",
)

Here is the verbose output:

import pandas as pd
import matplotlib.pyplot as plt

# create the dataframe
data = {'country': ['France', 'Germany', 'United States', 'Italy', 'United Kingdom'],
        'gdp': [2411255037952, 4384348979, 3917537461, 5079579099, 53509356],
        'happiness_index': [6.38, 6.38, 7.16, 6.66, 7.07]}
df = pd.DataFrame(data)

# sort the dataframe by GDP in descending order
df = df.sort_values(by='gdp', ascending=False)

# plot the horizontal bar chart
plt.barh(df['country'], df['gdp'], color=['red', 'blue', 'green', 'orange', 'purple'])
plt.xlabel('GDP')
plt.ylabel('Country')
plt.title('GDP by Country')
plt.show()

It looks like the LLM created a whole new data frame with completely different (and incorrect!) figures from the original data frame, even though I supplied it with a data frame.

Why did it create a new data frame? Is this hallucination?

Thanks you!

danieleongari commented 1 year ago

Yes, it happened to me too. The example dataset of the tutorials is not good for testing, as (I guess) it is a common example and the LLMs seems to often trying to create it from scratch knowing the pattern.

Once it was even funnier because I asked "Which country..." and he returned me "Qatar" which was not in the dataset!

I guess the best, for the moment, is to use a proprietary or generated dataset for avoiding hard-to-spot hallucinations, and as you said, better to check the code with verbose=True

ArhamCMB commented 1 year ago

Thank you!. Good to know that happens with others as well. Will have to keep an eye out for this type of hallucination.

gventuri commented 1 year ago

Potential naive solution: we programmatically check whether df is overridden and somehow fix that!

But I think a more long term solution can be found also with prompt engineering!

What do you think?

aditikhare007 commented 1 year ago

Contextual Prompt Engineering could be a solution for this will check this out.

gventuri commented 1 year ago

Great @aditikhare007 go for it! 🚀

danieleongari commented 1 year ago

@gventuri I think the solution may come from both your suggestion:

if you start adding too many instructions in the prompt engineering (maybe in the negative form "don't make a new dataset") the LLM can go out of focus on the main task
if you check the df and it is different, than you are stuck.

Using both mean: 1) Use a "gentle" prompt like "Use the provided dataset to..." 2) Check if the df is not overwritten with something else (not sure how) 3) If this is the case (overwritten) reinitiate the GPT interrogation with a stronger statement about the fact of using only the current data, and not other similar nor generated.

ArhamCMB commented 1 year ago

I have had some success with prefacing the prompt with "Using the 'gdp' column of the data frame provided". Example code and resulting output below. Haven't had the opportunity to thoroughly test it out, but this method should fit with my workflow for now. Manual steps:

visually check that a duplicate df has not been generated
Change to ascending=True within the sort_values method. Not sure why this happens.

Code:

import pandas as pd
from pandasai import PandasAI
from pandasai.llm.openai import OpenAI
import matplotlib.pyplot as plt

llm = OpenAI()
pandas_ai = PandasAI(llm, verbose=True)
df = pd.DataFrame({
    "country": ["United States", "United Kingdom", "France", "Germany", "Italy", "Spain", "Canada", "Australia", "Japan", "China"],
    "gdp": [19294482071552, 2891615567872, 2411255037952, 3435817336832, 1745433788416, 1181205135360, 1607402389504, 1490967855104, 4380756541440, 14631844184064],
    "happiness_index": [6.94, 7.16, 6.66, 7.07, 6.38, 6.4, 7.23, 7.22, 5.87, 5.12]
})

pandas_ai.run(
    df,
    "Using the 'gdp' column of the data frame provided, plot a horizontal bar chart. \
        Show GDP of each country per bar, sorted in decending order by GDP. \
            Each bar should be a different color.",
)

Output

Running PandasAI with openai LLM...

Code generated:

df = df.sort_values(by='gdp', ascending=False)
plt.barh(df['country'], df['gdp'], color=['red', 'orange', 'yellow', 'green', 'blue', 'purple', 'pink', 'brown', 'gray', 'black'])
plt.xlabel('GDP')
plt.ylabel('Country')
plt.title('GDP by Country')
plt.show()

Conversational answer: Sure! To create a horizontal bar chart showing the GDP of each country, we can use the 'gdp' column of the provided data frame. We'll want to sort the bars in descending order by GDP and make each bar a different color.

chart_gdp

ArhamCMB commented 1 year ago

Welp! Did another test with a small change to the prompt. Looks like prepending "Using the 'gdp' column of the data frame provided" does not guarantee df overwrite...

Code:

import pandas as pd
from pandasai import PandasAI
from pandasai.llm.openai import OpenAI
import matplotlib.pyplot as plt

llm = OpenAI()
pandas_ai = PandasAI(llm, verbose=True)
df = pd.DataFrame({
    "country": ["United States", "United Kingdom", "France", "Germany", "Italy", "Spain", "Canada", "Australia", "Japan", "China"],
    "gdp": [19294482071552, 2891615567872, 2411255037952, 3435817336832, 1745433788416, 1181205135360, 1607402389504, 1490967855104, 4380756541440, 14631844184064],
    "happiness_index": [6.94, 7.16, 6.66, 7.07, 6.38, 6.4, 7.23, 7.22, 5.87, 5.12]
})

pandas_ai.run(
    df,
    "Using the 'gdp' column of the data frame provided, plot a horizontal bar chart. \
        Show GDP of each country per bar, sorted in decending order by GDP. \
            The color of bars should be blue.",
)

Output:

Running PandasAI with openai LLM...

Code generated:

import pandas as pd
import matplotlib.pyplot as plt

# create the dataframe
data = {'country': ['United States', 'Italy', 'Germany', 'United Kingdom', 'France'],
        'gdp': [1745433788416, 6599626513, 8231626589, 8906471590, 1348903926],
        'happiness_index': [7.07, 6.38, 7.16, 6.66, 6.38]}
df = pd.DataFrame(data)

# sort the dataframe by GDP in descending order
df = df.sort_values('gdp', ascending=False)

# plot the horizontal bar chart
plt.barh(df['country'], df['gdp'], color='blue')

# set the x-axis label
plt.xlabel('GDP')

# show the plot
plt.show()

Answer: Conversational answer: Conversational answer: Sure! To create a horizontal bar chart showing the GDP of each country, we can use the 'gdp' column of the provided data frame. We'll sort the bars in descending order by GDP and color them blue.

chart_gdp_2

aditikhare007 commented 1 year ago

Retrieval Augmentation & Vector Embeddings apart from contextual prompt engineering could be a solution for fixing LLM Hallucinations working on this will update soon.

gventuri commented 1 year ago

@aditikhare007 go for it, I'm super curios to see if a significant improvement can be obtained with prompt engineering only! Keep us updated!

victor-hugo-dc commented 1 year ago

I was able to replicate this response but it fixed itself when I changed the task instruction in the PandasAI class to the following:

class PandasAI:
    """PandasAI is a wrapper around a LLM to make dataframes conversational"""

    _task_instruction: str = """
Today is {today_date}.
You are provided with a pandas dataframe (df) with {num_rows} rows and {num_columns} columns.
This is the result of `print(df.head({rows_to_display}))`:
{df_head}.

Using the provided dataframe, df, return the python code (do not import anything) and make sure to prefix the requested python code with {START_CODE_TAG} exactly and suffix the code with {END_CODE_TAG} exactly to get the answer to the following question:
"""

Let me know if that also changes things on your end! I think removing any kind of statement that reassigns df to a dataframe object along with prompt engineering would ensure that this problem is resolved. The following could be helpful:

tree = ast.parse(code)

tree.body = [node for node in tree.body if not (
    isinstance(node, ast.Assign) and
    isinstance(node.targets[0], ast.Name) and
    node.targets[0].id == "df" and
    isinstance(node.value, ast.Call) and
    isinstance(node.value.func, ast.Attribute) and
    node.value.func.attr == "DataFrame"
)]

code_to_run = ast.unparse(tree)

gventuri commented 1 year ago

@victor-hugo-dc definitely a good fix! I suggest we implement both what you just shared and what @aditikhare007 is working on. Would you mind opening a PR for that?

amjadraza commented 1 year ago

#Loading CSV file 
import pandas as pd

from pandasai import PandasAI
from pandasai.llm.openai import OpenAI
file_path ='https://raw.githubusercontent.com/gventuri/pandas-ai/main/examples/data/Loan%20payments%20data.csv'
df_in = pd.read_csv(file_path)

llm = OpenAI(api_token=OPENAI_API_KEY)
pandas_ai = PandasAI(llm, verbose=True)
response = pandas_ai.run(df_in, "How many loans are from men that have been paid off?")

Response

Running PandasAI with openai LLM...

Code generated:

# Import pandas library
import pandas as pd

# Create dataframe from given data
df = pd.DataFrame({
    'Loan_ID': ['xqd20160004', 'xqd20168902', 'xqd20166231', 'xqd20160005', 'xqd20160003'],
    'loan_status': ['PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF'],
    'Principal': [1000, 1000, 1000, 1000, 1000],
    'terms': [15, 30, 15, 30, 30],
    'effective_date': ['9/8/2016', '9/8/2016', '9/9/2016', '9/8/2016', '9/8/2016'],
    'due_date': ['10/7/2016', '10/8/2016', '9/22/2016', '10/7/2016', '10/7/2016'],
    'paid_off_time': ['9/25/2016 16:58', '9/14/2016 19:31', '9/23/2016 21:36', '9/22/2016 20:00', '10/7/2016 9:00'],
    'past_due_days': [pd.NA, pd.NA, pd.NA, pd.NA, pd.NA],
    'age': [28, 27, 50, 33, 33],
    'education': ['Bechalor', 'college', 'college', 'Bechalor', 'High School or Below'],
    'Gender': ['male', 'female', 'female', 'male', 'female']
})

# Count number of loans from men that have been paid off
num_loans = len(df[(df['Gender'] == 'male') & (df['loan_status'] == 'PAIDOFF')])

print(num_loans)


Conversational answer: Two loans have been paid off by men.

It looks like, it ran the code only df_in .head(5). Answer for displayed rows is correct but not for complete dataset. As in the code, while generating code we provide df_head as argument, is it because of that?

However the code works well for below dataframe.

df = pd.DataFrame({
    "country": ["United States", "United Kingdom", "France", "Germany", "Italy", "Spain", "Canada", "Australia", "Japan", "China"],
    "gdp": [21400000, 2940000, 2830000, 3870000, 2160000, 1350000, 1780000, 1320000, 516000, 14000000],
    "happiness_index": [7.3, 7.2, 6.5, 7.0, 6.0, 6.3, 7.3, 7.3, 5.9, 5.0]
})

Can anyone try and see if having same error. below is the link to my Google Colab.

https://colab.research.google.com/drive/1PF7PxRHmIusD2kf9bqCLIrhMZB18BF1r#scrollTo=GXeX1QghiC0F

Thanks

gventuri commented 1 year ago

@amjadraza should have been fixed as of 0.2.13! Can you check it out?

Big shout out to @victor-hugo-dc for the updated prompt and the code to remove overridden df!

amjadraza commented 1 year ago

@gventuri , yes it does work well with Openai API and feels like even faster. Thanks @victor-hugo-dc.

Sinaptik-AI / pandas-ai

Code generated can overrides the df #114

Response