Open MrPowers opened 3 years ago
I was thinking about this, also #9 and #10 to give the proyect some structure.
The idea I had is to split the scala part in two parts:
This will make the py bebe only dependent on bebe functions. About the build tool, I believe that this can be solved using CI, and some tools to make easier the local development. But at the end, the python version is only a facade of the scala one, they don't need to be mixed.
@alfonsorr - that organization sounds good to me. Sounds similar to what we have. mrpowers.bebe
is for the Bebe typed and org.apache.spark.sql
is for the Bebe functions. Unless you have something else in mind?
I'll work on getting some basic CI setup (to just run the tests) and then we can tweak it and make it better. Sounds like you have a good CI vision that I don't completely understand yet, but I'm sure I'll love your vision when I fully grasp it!
@zero323 - we're building a project to expose the Spark functions that are in the SQL API but not in the Scala API. We'd also like to expose the "missing" PySpark functions.
We could even use this repo for stuff like the CalendarIntervalType
that's missing from the PySpark API (if it doesn't get merged for whatever reason).
All the hard work to get these functions working is done already. We just need to expose them and make they easily accessible for end users. Let me know if you have any suggestions / comments on the best way to go about this. If not, then I can just try to study the source code and figure it out 🤓
@nchammas - we've added the functions that are in the SQL API, but missing in the Scala API, to this project. weekday is an example of a function that was added to Scala via bebe_weekday.
The next challenge is exposing these functions via PySpark. Perhaps the bebe_weekday
PySpark function can be created as follows, like [the other PySpark SQL functions]().
def bebe_weekday(col):
sc = SparkContext._active_spark_context
return Column(sc._jvm.functions.bebe_weekday(_to_java_column(col)))
Some specific questions:
Want to let PySpark users run pip install bebe
and get easy access to all these functions.
Thanks for the help! Hopefully this will be relatively easy, I just have no idea how to go about this.
If you're publishing a Spark library, I would follow the lead of GraphFrames. Both Scala and Python code live in the same repo, and people load the library for use via --packages
, which is the same regardless of language. No pip
.
I don't know if structuring things so that people can pip install bebe
would work better. It's certainly more natural for Python users, but I don't know if it would work as well as --packages
.
@nchammas - Think we'll need to get this in PyPI so users can add this as a regular project dependency right? pyspark and spark-testing-base are in PyPI so it must be possible.
Would the --packages loading approach work in Databricks or is that a solution that'd only work on the command line?
--packages
must work with Databricks since it's one of the oldest ways of loading additional libraries for Spark, but I haven't checked myself.
Publishing to PyPI should be fine. Perhaps spark-testing-base is a good model to follow then. I wonder how its setup differs from GraphFrames.
At least for GraphFrames, we know it's used as a runtime dependency (vs. just a testing dependency), so that might affect what publishing method you want to use. I'm not sure.
Will want to expose the "missing API functions" (e.g. regexp_extract_all) for the PySpark folks too.
Think we'll be able to follow this example and expose these relatively easily.
Something like this:
Not sure how to build the wheel file or publish to PyPi from a SBT repo. My other PySpark projects, like quinn, are built with Poetry which has built-in packaging / publishing commands.