Sinaptik-AI / pandas-ai

Chat with your database (SQL, CSV, pandas, polars, mongodb, noSQL, etc). PandasAI makes data analysis conversational using LLMs (GPT 3.5 / 4, Anthropic, VertexAI) and RAG.
https://pandas-ai.com
Other
13.57k stars 1.32k forks source link

pandas-ai fails with polars.scan_parquet input #1344

Open cboettig opened 2 months ago

cboettig commented 2 months ago

System Info

python 3.11, pandas-ai 2.2.14, ubuntu 22.04

🐛 Describe the bug

I'm looking for a way to work with large parquet files without loading entirely into RAM. This is incredibly well supported in packages like duckdb and polars already used in this project, but I can't see how to get it to work in the interface. For instance, the simple approach using polars.scan_parquet():

# All necessary imports at the beginning
import pandas as pd
from pandasai import Agent

# Sample DataFrame
df = pd.DataFrame({
    "country": ["United States", "United Kingdom", "France", "Germany", "Italy", "Spain", "Canada", "Australia", "Japan", "China"],
    "gdp": [19294482071552, 2891615567872, 2411255037952, 3435817336832, 1745433788416, 1181205135360, 1607402389504, 1490967855104, 4380756541440, 14631844184064],
    "happiness_index": [6.94, 7.16, 6.66, 7.07, 6.38, 6.4, 7.23, 7.22, 5.87, 5.12]
})
df.to_parquet("test.parquet")
import polars as pl

df = pl.scan_parquet("test.parquet")
# Instantiate a LLM
from pandasai.llm import OpenAI
#llm = OpenAI(api_token="YOUR_API_TOKEN")

df = Agent([df], config={"llm": llm})
df.chat('Which are the 5 happiest countries?')

Gives the error:

   196         connectors.append(PolarsConnector({"original_df": df}))
    198     else:
--> 199         raise ValueError(
    200             "Invalid input data. We cannot convert it to a dataframe."
    201         )
    202 except ImportError as e:
    203     raise ValueError(
    204         "Invalid input data. We cannot convert it to a dataframe."
    205     ) from e
dosubot[bot] commented 2 months ago

To work with large parquet files using polars.scan_parquet without loading them entirely into RAM and avoid the error "Invalid input data. We cannot convert it to a dataframe," you need to modify the _load_df method in the PolarsConnector class to handle lazy loading with polars.scan_parquet. Here is how you can do it:

  1. Import polars at the beginning of the file.
  2. Modify the _load_df method to use polars.scan_parquet for parquet files.

Here is the updated code:

import polars as pl

class PolarsConnector(BaseConnector):
    # ... other methods ...

    def _load_df(self, df: Union[PolarsDataFrame, PolarsSeries, str, dict]):
        """
        Load the dataframe from a file or polars dataframe.

        Args:
            df (Union[pl.DataFrame, pl.Series, str, dict]): The dataframe to load.
        """
        polars_df = None
        if isinstance(df, pl.Series):
            polars_df = df.to_frame()
        elif isinstance(df, pl.DataFrame):
            polars_df = df
        elif isinstance(df, str):
            if df.endswith('.parquet'):
                polars_df = pl.scan_parquet(df)
            else:
                polars_df = FileImporter.import_from_file(df)
        elif isinstance(df, dict):
            try:
                polars_df = pl.DataFrame(df)
            except Exception as e:
                raise ValueError(
                    "Invalid input data. We cannot convert it to a dataframe."
                ) from e
        else:
            raise ValueError("Invalid input data. We cannot convert it to a dataframe.")

        self.pandas_df = polars_df.collect().to_pandas()

This modification ensures that when a parquet file is provided, it is lazily loaded using polars.scan_parquet, which avoids loading the entire file into RAM. The collect() method is then used to convert the lazy frame into a pandas DataFrame when needed [1].

To continue talking to Dosu, mention @dosu.

cboettig commented 2 months ago

The Dosubot is just annoying here. brute-force coercing a lazy polars connection to a parquet file into a pandas dataframe defeats the purpose of using scan_parquet in the first place.