Closed ArhamCMB closed 1 year ago
Yes, it happened to me too. The example dataset of the tutorials is not good for testing, as (I guess) it is a common example and the LLMs seems to often trying to create it from scratch knowing the pattern.
Once it was even funnier because I asked "Which country..." and he returned me "Qatar" which was not in the dataset!
I guess the best, for the moment, is to use a proprietary or generated dataset for avoiding hard-to-spot hallucinations, and as you said, better to check the code with verbose=True
Thank you!. Good to know that happens with others as well. Will have to keep an eye out for this type of hallucination.
Potential naive solution: we programmatically check whether df is overridden and somehow fix that!
But I think a more long term solution can be found also with prompt engineering!
What do you think?
Contextual Prompt Engineering could be a solution for this will check this out.
Great @aditikhare007 go for it! 🚀
@gventuri I think the solution may come from both your suggestion:
Using both mean: 1) Use a "gentle" prompt like "Use the provided dataset to..." 2) Check if the df is not overwritten with something else (not sure how) 3) If this is the case (overwritten) reinitiate the GPT interrogation with a stronger statement about the fact of using only the current data, and not other similar nor generated.
I have had some success with prefacing the prompt with "Using the 'gdp' column of the data frame provided". Example code and resulting output below. Haven't had the opportunity to thoroughly test it out, but this method should fit with my workflow for now. Manual steps:
ascending=True
within the sort_values
method. Not sure why this happens.Code:
import pandas as pd
from pandasai import PandasAI
from pandasai.llm.openai import OpenAI
import matplotlib.pyplot as plt
llm = OpenAI()
pandas_ai = PandasAI(llm, verbose=True)
df = pd.DataFrame({
"country": ["United States", "United Kingdom", "France", "Germany", "Italy", "Spain", "Canada", "Australia", "Japan", "China"],
"gdp": [19294482071552, 2891615567872, 2411255037952, 3435817336832, 1745433788416, 1181205135360, 1607402389504, 1490967855104, 4380756541440, 14631844184064],
"happiness_index": [6.94, 7.16, 6.66, 7.07, 6.38, 6.4, 7.23, 7.22, 5.87, 5.12]
})
pandas_ai.run(
df,
"Using the 'gdp' column of the data frame provided, plot a horizontal bar chart. \
Show GDP of each country per bar, sorted in decending order by GDP. \
Each bar should be a different color.",
)
Output
Running PandasAI with openai LLM...
Code generated:
df = df.sort_values(by='gdp', ascending=False)
plt.barh(df['country'], df['gdp'], color=['red', 'orange', 'yellow', 'green', 'blue', 'purple', 'pink', 'brown', 'gray', 'black'])
plt.xlabel('GDP')
plt.ylabel('Country')
plt.title('GDP by Country')
plt.show()
Conversational answer: Sure! To create a horizontal bar chart showing the GDP of each country, we can use the 'gdp' column of the provided data frame. We'll want to sort the bars in descending order by GDP and make each bar a different color.
Welp! Did another test with a small change to the prompt. Looks like prepending "Using the 'gdp' column of the data frame provided" does not guarantee
df overwrite...
Code:
import pandas as pd
from pandasai import PandasAI
from pandasai.llm.openai import OpenAI
import matplotlib.pyplot as plt
llm = OpenAI()
pandas_ai = PandasAI(llm, verbose=True)
df = pd.DataFrame({
"country": ["United States", "United Kingdom", "France", "Germany", "Italy", "Spain", "Canada", "Australia", "Japan", "China"],
"gdp": [19294482071552, 2891615567872, 2411255037952, 3435817336832, 1745433788416, 1181205135360, 1607402389504, 1490967855104, 4380756541440, 14631844184064],
"happiness_index": [6.94, 7.16, 6.66, 7.07, 6.38, 6.4, 7.23, 7.22, 5.87, 5.12]
})
pandas_ai.run(
df,
"Using the 'gdp' column of the data frame provided, plot a horizontal bar chart. \
Show GDP of each country per bar, sorted in decending order by GDP. \
The color of bars should be blue.",
)
Output:
Running PandasAI with openai LLM...
Code generated:
import pandas as pd
import matplotlib.pyplot as plt
# create the dataframe
data = {'country': ['United States', 'Italy', 'Germany', 'United Kingdom', 'France'],
'gdp': [1745433788416, 6599626513, 8231626589, 8906471590, 1348903926],
'happiness_index': [7.07, 6.38, 7.16, 6.66, 6.38]}
df = pd.DataFrame(data)
# sort the dataframe by GDP in descending order
df = df.sort_values('gdp', ascending=False)
# plot the horizontal bar chart
plt.barh(df['country'], df['gdp'], color='blue')
# set the x-axis label
plt.xlabel('GDP')
# show the plot
plt.show()
Answer: Conversational answer: Conversational answer: Sure! To create a horizontal bar chart showing the GDP of each country, we can use the 'gdp' column of the provided data frame. We'll sort the bars in descending order by GDP and color them blue.
Retrieval Augmentation & Vector Embeddings apart from contextual prompt engineering could be a solution for fixing LLM Hallucinations working on this will update soon.
@aditikhare007 go for it, I'm super curios to see if a significant improvement can be obtained with prompt engineering only! Keep us updated!
I was able to replicate this response but it fixed itself when I changed the task instruction in the PandasAI class to the following:
class PandasAI:
"""PandasAI is a wrapper around a LLM to make dataframes conversational"""
_task_instruction: str = """
Today is {today_date}.
You are provided with a pandas dataframe (df) with {num_rows} rows and {num_columns} columns.
This is the result of `print(df.head({rows_to_display}))`:
{df_head}.
Using the provided dataframe, df, return the python code (do not import anything) and make sure to prefix the requested python code with {START_CODE_TAG} exactly and suffix the code with {END_CODE_TAG} exactly to get the answer to the following question:
"""
Let me know if that also changes things on your end! I think removing any kind of statement that reassigns df to a dataframe object along with prompt engineering would ensure that this problem is resolved. The following could be helpful:
tree = ast.parse(code)
tree.body = [node for node in tree.body if not (
isinstance(node, ast.Assign) and
isinstance(node.targets[0], ast.Name) and
node.targets[0].id == "df" and
isinstance(node.value, ast.Call) and
isinstance(node.value.func, ast.Attribute) and
node.value.func.attr == "DataFrame"
)]
code_to_run = ast.unparse(tree)
@victor-hugo-dc definitely a good fix! I suggest we implement both what you just shared and what @aditikhare007 is working on. Would you mind opening a PR for that?
#Loading CSV file
import pandas as pd
from pandasai import PandasAI
from pandasai.llm.openai import OpenAI
file_path ='https://raw.githubusercontent.com/gventuri/pandas-ai/main/examples/data/Loan%20payments%20data.csv'
df_in = pd.read_csv(file_path)
llm = OpenAI(api_token=OPENAI_API_KEY)
pandas_ai = PandasAI(llm, verbose=True)
response = pandas_ai.run(df_in, "How many loans are from men that have been paid off?")
Running PandasAI with openai LLM...
Code generated:
# Import pandas library
import pandas as pd
# Create dataframe from given data
df = pd.DataFrame({
'Loan_ID': ['xqd20160004', 'xqd20168902', 'xqd20166231', 'xqd20160005', 'xqd20160003'],
'loan_status': ['PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF'],
'Principal': [1000, 1000, 1000, 1000, 1000],
'terms': [15, 30, 15, 30, 30],
'effective_date': ['9/8/2016', '9/8/2016', '9/9/2016', '9/8/2016', '9/8/2016'],
'due_date': ['10/7/2016', '10/8/2016', '9/22/2016', '10/7/2016', '10/7/2016'],
'paid_off_time': ['9/25/2016 16:58', '9/14/2016 19:31', '9/23/2016 21:36', '9/22/2016 20:00', '10/7/2016 9:00'],
'past_due_days': [pd.NA, pd.NA, pd.NA, pd.NA, pd.NA],
'age': [28, 27, 50, 33, 33],
'education': ['Bechalor', 'college', 'college', 'Bechalor', 'High School or Below'],
'Gender': ['male', 'female', 'female', 'male', 'female']
})
# Count number of loans from men that have been paid off
num_loans = len(df[(df['Gender'] == 'male') & (df['loan_status'] == 'PAIDOFF')])
print(num_loans)
Conversational answer: Two loans have been paid off by men.
It looks like, it ran the code only df_in .head(5). Answer for displayed rows is correct but not for complete dataset. As in the code, while generating code
we provide df_head
as argument, is it because of that?
However the code works well for below dataframe.
df = pd.DataFrame({
"country": ["United States", "United Kingdom", "France", "Germany", "Italy", "Spain", "Canada", "Australia", "Japan", "China"],
"gdp": [21400000, 2940000, 2830000, 3870000, 2160000, 1350000, 1780000, 1320000, 516000, 14000000],
"happiness_index": [7.3, 7.2, 6.5, 7.0, 6.0, 6.3, 7.3, 7.3, 5.9, 5.0]
})
Can anyone try and see if having same error. below is the link to my Google Colab.
https://colab.research.google.com/drive/1PF7PxRHmIusD2kf9bqCLIrhMZB18BF1r#scrollTo=GXeX1QghiC0F
Thanks
@amjadraza should have been fixed as of 0.2.13! Can you check it out?
Big shout out to @victor-hugo-dc for the updated prompt and the code to remove overridden df!
@gventuri , yes it does work well with Openai API
and feels like even faster. Thanks @victor-hugo-dc.
Hello, I am trying to understand the output of a test run.
Here is my code:
Here is the verbose output:
It looks like the LLM created a whole new data frame with completely different (and incorrect!) figures from the original data frame, even though I supplied it with a data frame.
Why did it create a new data frame? Is this hallucination?
Thanks you!