databrickslabs / dbldatagen

Generate relevant synthetic data quickly for your projects. The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, POCs, and other uses in Databricks environments including in Delta Live Tables pipelines
https://databrickslabs.github.io/dbldatagen
Other
291 stars 57 forks source link

Compatibility Issue: dbldatagen DataAnalyzer Not Accepting Spark Connect DataFrame #260

Open npiesco opened 3 months ago

npiesco commented 3 months ago

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

import dbldatagen as dg
import pandas as pd
from pyspark.sql import DataFrame

dfSource = spark.sql("SELECT * FROM db.schema.table LIMIT 10000")
df = dfSource

analyzer = dg.DataAnalyzer(sparkSession=spark, df=dfSource)
generatedCode = analyzer.scriptDataGeneratorFromData()

print(generatedCode) #AssertionError: sourceDf must be a valid Pyspark dataframe

print(isinstance(df, DataFrame))  # Output: False
print(type(df))  # Output: <class 'pyspark.sql.connect.dataframe.DataFrame'> not pyspark.sql.dataframe.DataFrame

pdf = df.toPandas()
df_traditional = spark.createDataFrame(pdf)

print(isinstance(df_traditional, DataFrame))  # Output: False
print(type(df_traditional))  # Output: <class 'pyspark.sql.connect.dataframe.DataFrame'> not pyspark.sql.dataframe.DataFrame

Context

Your Environment

chris2shehu commented 2 months ago

Having the same issue. We're unable to find a work around.

ronanstokes-db commented 1 month ago

Thanks for raising this. We are working on preparing a new release with a number of feature updates and will look to incorporate a fix for this into the new release.

As a short term work around, we'll relax this check to a warning

ronanstokes-db commented 1 month ago

Current hotfix relaxes this warning. Hotfix was released today

chris2shehu commented 1 month ago

Thanks guys! Can confirm it's working for our team now. Keep up the great work!