Open cboettig opened 2 months ago
To work with large parquet files using polars.scan_parquet
without loading them entirely into RAM and avoid the error "Invalid input data. We cannot convert it to a dataframe," you need to modify the _load_df
method in the PolarsConnector
class to handle lazy loading with polars.scan_parquet
. Here is how you can do it:
polars
at the beginning of the file._load_df
method to use polars.scan_parquet
for parquet files.Here is the updated code:
import polars as pl
class PolarsConnector(BaseConnector):
# ... other methods ...
def _load_df(self, df: Union[PolarsDataFrame, PolarsSeries, str, dict]):
"""
Load the dataframe from a file or polars dataframe.
Args:
df (Union[pl.DataFrame, pl.Series, str, dict]): The dataframe to load.
"""
polars_df = None
if isinstance(df, pl.Series):
polars_df = df.to_frame()
elif isinstance(df, pl.DataFrame):
polars_df = df
elif isinstance(df, str):
if df.endswith('.parquet'):
polars_df = pl.scan_parquet(df)
else:
polars_df = FileImporter.import_from_file(df)
elif isinstance(df, dict):
try:
polars_df = pl.DataFrame(df)
except Exception as e:
raise ValueError(
"Invalid input data. We cannot convert it to a dataframe."
) from e
else:
raise ValueError("Invalid input data. We cannot convert it to a dataframe.")
self.pandas_df = polars_df.collect().to_pandas()
This modification ensures that when a parquet file is provided, it is lazily loaded using polars.scan_parquet
, which avoids loading the entire file into RAM. The collect()
method is then used to convert the lazy frame into a pandas DataFrame when needed [1].
To continue talking to Dosu, mention @dosu.
The Dosubot is just annoying here. brute-force coercing a lazy polars connection to a parquet file into a pandas dataframe defeats the purpose of using scan_parquet in the first place.
System Info
python 3.11, pandas-ai 2.2.14, ubuntu 22.04
🐛 Describe the bug
I'm looking for a way to work with large parquet files without loading entirely into RAM. This is incredibly well supported in packages like duckdb and polars already used in this project, but I can't see how to get it to work in the interface. For instance, the simple approach using polars.scan_parquet():
Gives the error: