PySpark DataFrame Support

Kanaries / pygwalker

PyGWalker: Turn your pandas dataframe into an interactive UI for visual analysis

https://kanaries.net/pygwalker

Apache License 2.0

13.38k stars 699 forks source link

PySpark DataFrame Support #189

Closed rishabmps closed 1 year ago

rishabmps commented 1 year ago

Native Support for rendering visualizations for PySpark data frame in the Jupyter notebook. It is OK to introduce some constraints if the sheer size of the data frame makes it difficult to load.

longxiaofei commented 1 year ago

Hi, How size(row count, column count, data szie) is your data?

pygwalker will support bigger datas(<= your computer memory size) in next version.

rishabmps commented 1 year ago

1M rows, approximately 200 columns (mix of float and categorical values). The ideal user experience would be able to use pyspark df directly in the walk function.

longxiaofei commented 1 year ago

Pygwalker will support pyspark within the next 4 versions.

longxiaofei commented 1 year ago

version 0.3.3 already support pyspark.dataframe , but calculation process of spark not suitable to get datas for render charts.

If the dataframe(spark) you end up needing to analyze doesn't exceed three times your machine's RAM, you can convert to pandas.dataframe: df = spark_df.toPandas(),

then use duckdb to calculation datas in pygwlaker: pyg.walk(df, use_kernel_calc=True)

rishabh-dream11 commented 1 year ago

Pygwalker will support pyspark within the next 4 versions. @longxiaofei Is this still in plan?

longxiaofei commented 1 year ago

@rishabh-dream11 Dataframe of pyspark is currently supported, but pyspark is not suitable for this kind of interactive calculation.