Sinaptik-AI / pandas-ai

Chat with your database (SQL, CSV, pandas, polars, mongodb, noSQL, etc). PandasAI makes data analysis conversational using LLMs (GPT 3.5 / 4, Anthropic, VertexAI) and RAG.
https://pandas-ai.com
Other
13.2k stars 1.28k forks source link

Pandas AI not working with Azure Open AI GPT-4 version #458

Closed rucha80 closed 3 months ago

rucha80 commented 1 year ago

🐛 Describe the bug

Code

import pandas as pd import os

from pandasai import SmartDataframe from pandasai.llm import AzureOpenAI

df =pd.read_csv('search_data_v3.csv')

os.environ['OPENAI_API_KEY'] = "" os.environ['OPENAI_API_BASE'] = ""

deployment_name = ""

llm = AzureOpenAI( deployment_name=deployment_name, api_version="",

is_chat_model=True, # Comment in if you deployed a chat model

)

df = SmartDataframe(df, config={"llm": llm}) response = df.chat("How many rows in data?") print(response)

When I run run the above code I get below error.

Output

Unfortunately, I was not able to answer your question, because of the following error:

The completion operation does not work with the specified model, gpt-4. Please choose different model and try again. You can learn more about which models can be used with each operation here: https://go.microsoft.com/fwlink/?linkid=2197993.

gventuri commented 1 year ago

Hi @rucha80, thanks a lot for reporting. Haven't access to Azure OpenAI apis, but I think the api_version is mandatory. Let me know if that fix it!

mspronesti commented 1 year ago

@gventuri @rucha80 I'll have a look after returning from my vacations :)

mspronesti commented 1 year ago

@rucha80 This issue does not seem to be related to PandasAI but to Azure OpenAI itself. Here's a MRE totally unrelated to pandasAI:

import openai

openai.api_type = 'azure'
openai.api_version = '2023-07-01-preview' # I also tried with older versions, to no avail
openai.api_key = ... # your API key here 
openai.api_base = ... # your API endpoint base here
deployment_name = 'gpt-4-8192' # this is the name I used for my deployment

prompt = "What is 2+2?"
openai.Completion.create(engine=deployment_name, prompt=prompt)

which yields the following InvalidRequestError:

InvalidRequestError: The completion operation does not work with the specified model, gpt-4-8192. Please choose different model and try again. You can learn more about which models can be used with each operation here: https://go.microsoft.com/fwlink/?linkid=2197993.

Nevertheless, I'd like to point out that the models from the gpt-4 and gpt-3.5 families perform much better with the ChatCompletion API thus I encourage you to always set is_chat_model to True when using them.

rucha80 commented 1 year ago

llm=AzureOpenAI(api_token=os.environ['OPENAI_API_KEY'],api_base=os.environ['OPENAI_API_BASE'],\ api_version=os.environ["OPENAI_API_VERSION"],deployment_name='',\ temperature=0,model='gpt-4',api_type=os.environ['OPENAI_API_TYPE'],is_chat_model=True)

It is working when I give is_chat_model=True.

but now I am not able to run the pandasai. It is getting stuck. I restarted the kernel and deleted pandasai log. Is there any files we need to update?

Below is the screenshot where it is getting stuck.

image

mspronesti commented 1 year ago

Can you share a minimal reproducible example? I tried again Azure OpenAI's GPT-4 with one of the examples available in the examples folder and it works for me

rucha80 commented 1 year ago

It is working for me as well. Maybe some mistake from my side. Also I need the python code which pandasai is generating. In earlier pandasai version if I give show_code=True it was giving code. Is it possible to get code as output as well so that it can be stored in a variable.

mspronesti commented 1 year ago

I don't think we have ever had a show_code parameter. It was verbose and you can still use it as a configuration parameter as follows

df = SmartDataframe(df, config={"llm": llm, "verbose": True})

Alternatively, you can use a callback to export the code (e.g. to a file)

from pandasai.callbacks import FileCallback

df = SmartDataframe(df, {"callback": FileCallback("output.py")})
rucha80 commented 1 year ago

Thank you for the reply. When I m asking for number of rows in Data. It is just printing the number. I want the answer to be conversational. In earlier version there was a option of setting is_conversational = True. Do you still have some option like that?

mspronesti commented 1 year ago

I believe we don't, as explained here.

rucha80 commented 1 year ago

Sometimes pandas_ai log file and output.py is not getting updated. There is a lag in updation of output.py file. I want to get python code from output.py file as soon as I get the response from LLM. Could you help me with that?

kchusam commented 1 year ago

I was using pandasai v.0.8.4 for many weeks and there used to be an argument show_code = True in the run method in pandasAI class and there was also verbose = True argument. show_code = True showed really the line of code generated and verbose show more things ( I don't quite remember anymore). Now, I updated to the latest version of pandasai and in the pandasAI class in the run method, there is still the show_code = True argument, but this argument generates exactly the same output as verbose=True in SmartDataFrame. Is this behaviour desired? What I really missed is show_code=True from the older version The screenshot below is from pandasai 1.1.3 image

rucha80 commented 1 year ago

@gventuri Can I know the exact response time pandasai is taking other than LLM response time. I want to know the response time of LLM and response time of pandasai separately.

ronaldsholt commented 1 year ago

Adding more to #Azure problems. AzureOpenAI is inconsistent with SDL and SDF. It will also fail intermittently when adjusting for custom_prompting.

I've tested AzureOpenAI with GPT35-turbo, and GPT4 and here are the responses/errors:

  1. "Unfortunately, I was not able to answer your question, because of the following error:

single positional indexer is out-of-bounds"

  1. "Unfortunately, I was not able to answer your question, because of the following error:

No code in the log file.

Interestingly, no issues using OpenAI (from pandasai) for both GPT35-turbo and GPT4; While using SDL and SDF's with or without custom prompts.

AzureOpenAI on the other hand will def not use custom prompts, nor will reliably handle calling .chat().

I'm using v1.2.1 - please help!

mspronesti commented 1 year ago

@ronaldsholt can you please provide a meaningful reproducible example? If you use a custom file, please provide that as well.

ronaldsholt commented 1 year ago

@mspronesti Sure. -- TBH, today, I'm getting "single positional indexer is out-of-bounds" for everything I try : /

  1. SDF + AzureOpenAI
    
    import pandas as pd
    import pandasai
    print(pandasai.__version__)
    from pandasai import PandasAI
    from pandasai.llm.azure_openai import AzureOpenAI
    from pandasai import SmartDataframe, SmartDatalake

sample data

df = pd.DataFrame({ "country": ["United States", "United Kingdom", "France", "Germany", "Italy", "Spain", "Canada", "Australia", "Japan", "China"], "gdp": [19294482071552, 2891615567872, 2411255037952, 3435817336832, 1745433788416, 1181205135360, 1607402389504, 1490967855104, 4380756541440, 14631844184064], "happiness_index": [6.94, 7.16, 6.66, 7.07, 6.38, 6.4, 7.23, 7.22, 5.87, 5.12] })

df2 = pd.DataFrame({ "country": ["United States", "United Kingdom", "France", "Germany", "Italy", "Spain", "Canada", "Australia", "Japan", "China"], "gdp": [19294482071552, 2891615567872, 2411255037952, 3435817336832, 1745433788416, 1181205135360, 1607402389504, 1490967855104, 4380756541440, 14631844184064], "happiness_index": [6.94, 7.16, 6.66, 7.07, 6.38, 6.4, 7.23, 7.22, 5.87, 5.12] })

AZURE_OAI_KEY= "" AZURE_OAI_MODEL= "gpt-35-turbo" AZURE_OPENAI_ENDPOINT= "" AZURE_OPENAI_VERSION = ""

Instantiate a LLM

llm = AzureOpenAI( api_token=AZURE_OAI_KEY, deployment_name=AZURE_OAI_MODEL, api_base=AZURE_OPENAI_ENDPOINT, api_version=AZURE_OPENAI_VERSION )

sdf = SmartDataframe(df, name="data", config={ "llm": llm, "is_chat_model": True, "enable_cache": False, "max_retries": 10, "use_error_correction_framework": True, "verbose": True, "enforce_privacy": True} )

pandas_ai = PandasAI(llm)

pandas_ai(sdf, prompt='Which are the 5 happiest countries?')

response = sdf.chat("Which are the 5 happiest countries?") print(response)

Response 1:

1.2.2 2023-09-15 12:34:54 [INFO] Question: Which are the 5 happiest countries? 2023-09-15 12:34:54 [INFO] Running PandasAI with azure-openai LLM... 2023-09-15 12:34:54 [INFO] Prompt ID: 79232624-cf74-4f5f-b46f-2c0f85e8e525 Unfortunately, I was not able to answer your question, because of the following error:

single positional indexer is out-of-bounds


2. SDL + AzureOpenAI:
```python
# Instantiate a LLM
llm = AzureOpenAI(
                    api_token=AZURE_OAI_KEY,
                    deployment_name=AZURE_OAI_MODEL,
                    api_base=AZURE_OPENAI_ENDPOINT,
                    api_version=AZURE_OPENAI_VERSION
)

sdf = SmartDatalake([df,df2], config={
                    "llm": llm,
                    "is_chat_model": True,
                    "enable_cache": False,
                    "max_retries": 10,
                    "use_error_correction_framework": True,
                    "verbose": True,
                    "enforce_privacy": True}
)

# pandas_ai = PandasAI(llm)
# pandas_ai(sdf, prompt='Which are the 5 happiest countries?')

response = sdf.chat("Which are the 2 happiest countries?") #changed incase silent caching
print(response)

Response 2:

1.2.2
2023-09-15 12:40:19 [INFO] Question: Which are the 5 happiest countries?
2023-09-15 12:40:19 [INFO] Running PandasAI with azure-openai LLM...
2023-09-15 12:40:19 [INFO] Prompt ID: 799d0383-1dda-4ffe-92c8-1f8a7f5da820
Unfortunately, I was not able to answer your question, because of the following error:

single positional indexer is out-of-bounds
  1. SDF + AzureOpenAI + Custom Prompting:

    # Custom Prompts Integration 
    class CustomAnalysisPrompt(Prompt):
        text = """
        You are an expert python coder that specializes in writing pandas code. You are also an expert data analyst. Here's the metadata to analyze for the given pandas DataFrames:
    
        {dataframes}
    
        ```python
        # TODO import all the dependencies required
        import pandas as pd
    
        # Given this data, please follow these steps:
        # # 0. CAPITALIZE all indexible data for filtering! NOTE: "Q1" = quarter 1, "Q2" = quarter 2, "Q3" = quarter 3, and "Q4" = quarter 4 and this can be found n the dfs["quarter"] column. EXAMPLE = (dfs[0]['quarter'] == 'Q2')]; 
        # 1. **Data Analysis**: Dive deep into the data to identify patterns in order to answer the question with supporting context. 
        # 2. **Opportunity Identification**: If asked, Explain and Provide specific metrics in a sentance if available.
        # 3. **Reasoning**: Study the data and explain why these results are happening
        # 4. **Recommendations**: If asked, Suggest actionable steps
        # 5. **Output**: Return a dictionary with:
        # - type (possible values: "text", "number", "dataframe")
        # - value (can be a string, a dataframe, or the path of the plot, NOT a dictionary)
        # Example: {{ "type": "text", "value": "In US, the average sales are $4000, indicating a potential market X." }}
    
        def analyze_data(dfs: list[pd.DataFrame]) -> dict:
        # Code goes here (do not add comments)
    
        # Declare a result variable
        result = analyze_data(dfs)
    Using the provided dataframes (`dfs`), update the Python code based on the user's query:
    {conversation}
    
    # Updated code:
    # """

sdf = SmartDatalake([df,df2], config={ "llm": llm, "is_chat_model": True, "enable_cache": False, "max_retries": 10, "use_error_correction_framework": True, "custom_prompts": {"generate_python_code": CustomAnalysisPrompt()}, "verbose": True, "enforce_privacy": True} )

pandas_ai = PandasAI(llm)

pandas_ai(sdf, prompt='Which are the 5 happiest countries?')

response = sdf.chat("Which are the 2 happiest countries? Explain.") print(response)


Response 3: Same response. 

4. Running the exact same code (with or without custom prompt) on OpenAI results is really quality response types. With either 35-turbo and 4. 

Notes: I also experimented with:

```python
sdf = SmartDatalake([df,df2], config={
                    "llm": llm}) 

for both SDF and SDL, and oddly, it would sometimes work 1 or 2x then go back to ""single positional indexer is out-of-bounds".

Hope this helps. Really hoping to be able to use Azure for this!

mspronesti commented 1 year ago

@ronaldsholt Thanks for the MRE. Looks like this has nothing to with AzureOpenAI but with enforce_privacy. I've just tried with both OpenAI and AzureOpenAI using your MRE and I get the same error. Can you try again with the "standard" OpenAI LLM and enforce_privacy set to True ? Also, can you try again with AzureOpenAI without enforcing privacy?

On a different note, please notice that is_chat_model is not a config parameter, but a LLM's one and as such you need to pass it to AzureOpenAI. Since OpenAI has now marked the completion API as "legacy", I believe I will open a PR to default it to True.

ronaldsholt commented 1 year ago

@mspronesti Sure here ya go:


# Same df's above 

llm = OpenAI(api_token=OPENAI_API_KEY)

sdf = SmartDatalake([df,df2], config={
                    "llm": llm,
                    "is_chat_model": True,
                    "enable_cache": False,
                    "max_retries": 10,
                    "use_error_correction_framework": True,
                    "custom_prompts": {"generate_python_code": CustomAnalysisPrompt()},
                    "verbose": True,
                    "enforce_privacy": True}
)

# pandas_ai = PandasAI(llm)
# pandas_ai(sdf, prompt='Which are the 5 happiest countries?')

response = sdf.chat("Which are the 2 happiest countries? Explain.")

Sure enough, removing enforce privacy & is_chat_model reproduces the same output: "single positional indexer is out-of-bounds".

And when removing them it seems to function well (using OpenAI).

llm = OpenAI(api_token=OPENAI_API_KEY)

sdf = SmartDatalake([df,df2], config={
                    "llm": llm,
                    "enable_cache": False,
                    "max_retries": 10,
                    "use_error_correction_framework": True,
                    "custom_prompts": {"generate_python_code": CustomAnalysisPrompt()},
                    "verbose": True}
)

# pandas_ai = PandasAI(llm)
# pandas_ai(sdf, prompt='Which are the 5 happiest countries?')

response = sdf.chat("Which are the 2 happiest countries? Explain.")
print(response)

Response: "The two happiest countries are Canada and Canada. These countries have high happiness indexes, indicating that their citizens are generally satisfied with their lives."

Which is cool bc custom prompted works nicely! Alas, need this with Azure!!

ronaldsholt commented 1 year ago

@mspronesti removing enforce privacy and chat model for AzureOpenAI , for v1.2.1 still results in "Unfortunately, I was not able to answer your question, because of the following error:

No code found in the response" -- FYI

mspronesti commented 1 year ago

@ronaldsholt I'm really not managing the reproduce your error. Let me try to paste here my snippets (which is actually a slightly modified version of yours):

import pandas as pd
import pandasai
print(pandasai.__version__)
from pandasai.llm import AzureOpenAI, OpenAI
from pandasai import SmartDataframe

# sample data
df = pd.DataFrame({
    "country": ["United States", "United Kingdom", "France", "Germany", "Italy", "Spain", "Canada", "Australia", "Japan", "China"],
    "gdp": [19294482071552, 2891615567872, 2411255037952, 3435817336832, 1745433788416, 1181205135360, 1607402389504, 1490967855104, 4380756541440, 14631844184064],
    "happiness_index": [6.94, 7.16, 6.66, 7.07, 6.38, 6.4, 7.23, 7.22, 5.87, 5.12]
})

llm = AzureOpenAI(
    api_version="2023-07-01-preview",
    api_token=AZURE_OPENAI_TOKEN,
    api_base=AZURE_OPENAI_BASE,
    deployment_name=DEPLOYMENT_NAME,
    is_chat_model=True # <--- NOTICE!
)

sdf = SmartDataframe(df, config={
                    "llm": llm,
                    "enable_cache": False,
                    "verbose": True,
                    "enforce_privacy": False # <--- NOTICE!
                    }
)

response = sdf.chat("Which are the 5 happiest countries?")
print(response)

I've run this with the following models:

They all produce meaningful results for me. Which of these models are you using?

ronaldsholt commented 1 year ago

@mspronesti gpt-35-0515, gpt-35-turbo-16k-0613, and gpt-4 (gpt4 only on openai) # -- Previously using 1.0.11 and now 1.2.1

Ok, so using your example on gpt-35-0515 (which to be fair was the most finicky) this seems to work. Not sure why those params moved around make it work better, but it could be worth adding to the docs. Do you know if it is model-dependent?

# Instantiate a LLM
llm = AzureOpenAI(
                    api_token=AZURE_OAI_KEY,
                    deployment_name=AZURE_OAI_MODEL,
                    api_base=AZURE_OPENAI_ENDPOINT,
                    api_version=AZURE_OPENAI_VERSION,
                    is_chat_model=True # <--- NOTICE!
)

sdf = SmartDataframe(df, config={
                    "llm": llm,
                    "enable_cache": False,
                    "verbose": True,
                    "enforce_privacy": False # <--- NOTICE!
                    }
)

BUT @mspronesti the biggest kicker here is that customprompt fails with "Unfortunately, I was not able to answer your question, because of the following error: No code found in the response"._

sdf = SmartDataframe(df, config={
                    "llm": llm,
                    "enable_cache": False,
                    "verbose": True,
                    "custom_prompts": {"generate_python_code": CustomAnalysisPrompt()},
                    "enforce_privacy": False # <--- NOTICE!
                    }
)

Is it a function of azure ChatCompletion? Or can you trace to something else?

rucha80 commented 1 year ago

@gventuri @mspronesti I am working on a application where I am sending a very large dataset. Because of that response time and token size is increasing. I was thinking to handle that I would just send sample of code with 100 rows and get code generated from LLM and then run the code in my local on my data. To do this, I would like to know the pandasai architechture. Exactly where you are calling LLM and getting the code. I can then workaround it to execute my logic. Could you help me with that?

gventuri commented 1 year ago

@rucha80 it's exactly what PandasAI does under the hood, so it never sends the whole dataframe, but finds 5 meaningful (and anonymized) samples and send them to the LLM.

You can even pass a custom sample as you instantiate the smart dataframe, like this:

sample_df = pd.DataFrame(...)
sdf = SmartDataframe(your_df, sample_head=sample_df)
sdf.chat("your query")
devnaelson commented 6 months ago

I discover this problem, i did not understand why azure no is so much good in answer prompts. Change this to openai api, you will see the result.