Open MrPowers opened 7 months ago
We recently released an update to deal with situations where the spark context is not available to query things like default parallelism. This should address this
In general, the way to safeguard against this is to explicitly specify the number of partitions requested when generating the specification for your dataset. This will avoid the query against the sparkContext.
While we have not tested against Spark Connect, we have tested against other environments where there is no sparkContext available
Expected Behavior
This library works the same with Spark Connect.
Current Behavior
This library uses
sparkSession.sparkContext
which doesn't work with Spark Connect, here is an example: https://github.com/databrickslabs/dbldatagen/blob/debb29fc5d9da88b88fbcf12ba22ce24390ab062/dbldatagen/data_generator.py#L251. This actually might work cause the exception would be caught, but you get the idea.Steps to Reproduce (for bugs)
Run the test suite with Spark Connect enabled and fix all issues.