Sinaptik-AI / pandas-ai

Chat with your database (SQL, CSV, pandas, polars, mongodb, noSQL, etc). PandasAI makes data analysis conversational using LLMs (GPT 3.5 / 4, Anthropic, VertexAI) and RAG.
https://pandas-ai.com
Other
11.7k stars 1.08k forks source link

Empty dataframe is generated in code execution stage, which result in empty chart #1257

Open xuyuanme opened 3 days ago

xuyuanme commented 3 days ago

System Info

The current version of pandasai being used: 2.2.7

🐛 Describe the bug

In some case LLM will generate code to retrieve different subset of a same dataframe. In the example below it's generating the "first_purchases" dataframe as well as "repurchases" dataframe. Then these two dataframes are merged for further analysis, which makes perfect sense:

# Assuming dfs[0] is the DataFrame we need to work with
df = dfs[0]

# Filter out first-time purchases and repurchases
first_purchases = df[df['buyer_purchase_sequence'] == 1][['buyer_uid', 'product_type']]
repurchases = df[df['buyer_purchase_sequence'] > 1][['buyer_uid', 'product_type']]

# Merge first purchases with repurchases to track the product type of the first purchase against repurchases
merged_df = pd.merge(repurchases, first_purchases, on='buyer_uid', suffixes=('_repurchase', '_first_purchase'))

In the code_execution phase, two filters are extracted: "buyer_purchase_sequence == 1" (for "first_purchases" dataframe) and "buyer_purchase_sequence > 1" (for "repurchases" dataframe).

However in pandasai/connectors/sql.py, this will concatenate a where clause: where buyer_purchase_sequence == 1 AND buyer_purchase_sequence > 1. This will result an empty dataframe in dfs, and eventually an empty chart.

I understand the design of filtering dataframe is for better system performance. But considering all possible cases, the "where clause" needs to use OR instead of AND to ensure all data gets loaded in dfs.

dosubot[bot] commented 3 days ago

Hey @xuyuanme! 👋 I'm here to help you with any bugs, questions, or contributions you have in mind. Let's tackle this together!

To address the issue where the where clause in pandasai/connectors/sql.py uses AND instead of OR, you need to modify the _build_query method. Specifically, replace the " AND ".join(condition_strings) with " OR ".join(condition_strings).

Here is the modified _build_query method:

def _build_query(self, limit=None, order=None):
    base_query = select("*").select_from(text(self.cs_table_name))
    if self.config.where or self._additional_filters:
        # conditions is the list of where + additional filters
        conditions = []
        if self.config.where:
            conditions += self.config.where
        if self._additional_filters:
            conditions += self._additional_filters

        query_params = {}
        condition_strings = []
        valid_operators = ["=", ">", "<", ">=", "<=", "LIKE", "!=", "IN", "NOT IN"]

        for i, condition in enumerate(conditions):
            if len(condition) == 3:
                column_name, operator, value = condition
                if operator in valid_operators:
                    self._validate_column_name(column_name)

                    condition_strings.append(f"{column_name} {operator} :value_{i}")
                    query_params[f"value_{i}"] = value

        if condition_strings:
            where_clause = " OR ".join(condition_strings)  # Changed from AND to OR
            base_query = base_query.where(
                text(where_clause).bindparams(**query_params)
            )

    if order:
        base_query = base_query.order_by(asc(text(order)))

    if limit:
        base_query = base_query.limit(limit)

    return base_query

This change ensures that the conditions in the where clause are combined using OR instead of AND, which will load all relevant data into dfs [1].

To continue talking to Dosu, mention @dosu.