Create PySpark interface

MrPowers commented 3 years ago

Will want to expose the "missing API functions" (e.g. regexp_extract_all) for the PySpark folks too.

Think we'll be able to follow this example and expose these relatively easily.

Something like this:

def add_months(col, regexp, group_index):
    sc = SparkContext._active_spark_context
    return Column(sc._jvm.functions.regexp_replace_all(_to_java_column(start), _to_java_column(regexp), _to_java_column(group_index)))

Not sure how to build the wheel file or publish to PyPi from a SBT repo. My other PySpark projects, like quinn, are built with Poetry which has built-in packaging / publishing commands.

alfonsorr commented 3 years ago

I was thinking about this, also #9 and #10 to give the proyect some structure.

The idea I had is to split the scala part in two parts:

Bebe functions: this will be the exposer of the non existing spark functions but only using native spark components
Bebe typed: API to have a more typesafe dataframes

This will make the py bebe only dependent on bebe functions. About the build tool, I believe that this can be solved using CI, and some tools to make easier the local development. But at the end, the python version is only a facade of the scala one, they don't need to be mixed.

MrPowers commented 3 years ago

@alfonsorr - that organization sounds good to me. Sounds similar to what we have. mrpowers.bebe is for the Bebe typed and org.apache.spark.sql is for the Bebe functions. Unless you have something else in mind?

I'll work on getting some basic CI setup (to just run the tests) and then we can tweak it and make it better. Sounds like you have a good CI vision that I don't completely understand yet, but I'm sure I'll love your vision when I fully grasp it!

MrPowers commented 3 years ago

@zero323 - we're building a project to expose the Spark functions that are in the SQL API but not in the Scala API. We'd also like to expose the "missing" PySpark functions.

We could even use this repo for stuff like the CalendarIntervalType that's missing from the PySpark API (if it doesn't get merged for whatever reason).

All the hard work to get these functions working is done already. We just need to expose them and make they easily accessible for end users. Let me know if you have any suggestions / comments on the best way to go about this. If not, then I can just try to study the source code and figure it out 🤓

MrPowers commented 3 years ago

@nchammas - we've added the functions that are in the SQL API, but missing in the Scala API, to this project. weekday is an example of a function that was added to Scala via bebe_weekday.

The next challenge is exposing these functions via PySpark. Perhaps the bebe_weekday PySpark function can be created as follows, like [the other PySpark SQL functions]().

def bebe_weekday(col):
    sc = SparkContext._active_spark_context
    return Column(sc._jvm.functions.bebe_weekday(_to_java_column(col)))

Some specific questions:

Do you think we should put the Python code directly in this repo?
How should we publish the wheel files to PyPi?

Want to let PySpark users run pip install bebe and get easy access to all these functions.

Thanks for the help! Hopefully this will be relatively easy, I just have no idea how to go about this.

nchammas commented 3 years ago

If you're publishing a Spark library, I would follow the lead of GraphFrames. Both Scala and Python code live in the same repo, and people load the library for use via --packages, which is the same regardless of language. No pip.

I don't know if structuring things so that people can pip install bebe would work better. It's certainly more natural for Python users, but I don't know if it would work as well as --packages.

MrPowers commented 3 years ago

@nchammas - Think we'll need to get this in PyPI so users can add this as a regular project dependency right? pyspark and spark-testing-base are in PyPI so it must be possible.

Would the --packages loading approach work in Databricks or is that a solution that'd only work on the command line?

nchammas commented 3 years ago

--packages must work with Databricks since it's one of the oldest ways of loading additional libraries for Spark, but I haven't checked myself.

Publishing to PyPI should be fine. Perhaps spark-testing-base is a good model to follow then. I wonder how its setup differs from GraphFrames.

At least for GraphFrames, we know it's used as a runtime dependency (vs. just a testing dependency), so that might affect what publishing method you want to use. I'm not sure.

MrPowers / bebe

Create PySpark interface #11