databrickslabs / dbldatagen

Generate relevant synthetic data quickly for your projects. The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, POCs, and other uses in Databricks environments including in Delta Live Tables pipelines
https://databrickslabs.github.io/dbldatagen
Other
313 stars 59 forks source link

Upgrade this lib to be compatible with Spark Connect / DB Connect #255

Open MrPowers opened 7 months ago

MrPowers commented 7 months ago

Expected Behavior

This library works the same with Spark Connect.

Current Behavior

This library uses sparkSession.sparkContext which doesn't work with Spark Connect, here is an example: https://github.com/databrickslabs/dbldatagen/blob/debb29fc5d9da88b88fbcf12ba22ce24390ab062/dbldatagen/data_generator.py#L251. This actually might work cause the exception would be caught, but you get the idea.

Steps to Reproduce (for bugs)

Run the test suite with Spark Connect enabled and fix all issues.

ronanstokes-db commented 6 months ago

We recently released an update to deal with situations where the spark context is not available to query things like default parallelism. This should address this

In general, the way to safeguard against this is to explicitly specify the number of partitions requested when generating the specification for your dataset. This will avoid the query against the sparkContext.

While we have not tested against Spark Connect, we have tested against other environments where there is no sparkContext available