Open wchau opened 6 months ago
Thanks for filing the issue!
There is already an experimental Spark DataFrame
support:
We're planning to make it as official API in the next release in ~2 months (including adding documentation for this API).
Any feedback on the this API is welcome.
Hey Vadym,
Is there a reason why you use RDD instead of DataFrame
directly? I think the QueryBuilder tries to convert to RDD, which Spark Connect does not support.
Thanks!
Ah, I see, Spark Connect doesn't have RDD at all.
The main reason why RDD
is used is that the direct handling of Spark DataFrame
is not yet impelemented. The PipelineDP main logic is pretty agnostic to the input colllection type. So it should be reasonable simple to extend to DataFrames
. I'll check how it can be done.
Feature Description
Spark Connect Support
Is your feature request related to a problem?
In Spark Connect, RDD is not supported, so PipelineDP does not work. See https://github.com/apache/spark/blob/master/python/pyspark/sql/connect/dataframe.py
What alternatives have you considered?
N/A
Additional Context
Add any other context or screenshots about the feature request here.