Sinaptik-AI / pandas-ai

Chat with your database (SQL, CSV, pandas, polars, mongodb, noSQL, etc). PandasAI makes data analysis conversational using LLMs (GPT 3.5 / 4, Anthropic, VertexAI) and RAG.
https://pandas-ai.com
Other
11.69k stars 1.08k forks source link

Support for Pyspark dataframe #1103

Open rishabh-dream11 opened 2 months ago

rishabh-dream11 commented 2 months ago

🚀 The feature

Pyspark is used widely in the community for ETL work involving large datasets. Adding support for it will increase adoption for the product.

Motivation, pitch

My org uses, Pyspark as the only framework for ETL, EDA is done by visualising various cuts of the same pyspark dataframe.

Alternatives

No response

Additional context

No response

gventuri commented 2 months ago

This would be an interesting addition. Not sure about how easy it would be to add support for pyspark in the current setup, but it's definitely worth exploring. So you would like to use pyspark as an engine if I understand correctly. Or you just want to be able to provide a spark dataframe as an input?

rishabh-dream11 commented 2 months ago

Pyspark engine and that has to support spark dataframe as input.

rishabh-dream11 commented 1 month ago

@gventuri Is there any progress/discussion on this issue? Will this be considered for future releases?

ssling0817 commented 4 weeks ago

@gventuri I am also wondering if it can execute pyspark code. It took too long to query a table which is large. Or is there any workaround to replace the code to pyspark code inside the pipeline?