Closed neontty closed 3 months ago
fyi I haven't made a test suite yet but I am testing with this repo https://github.com/SimpleDataLabsInc/rj-test-spark-submit which contains both project level and pipeline level maven dependencies. Will do soon... but maybe needed after this PR.
the pom.xml is valid except for the {{SPARK_VERSION}} if it is not set at runtime
hi @jainabhinav can you please approve when you have the chance? I have moved the option behind a flag
If users are deploying WHL artifacts on a non-standard platform (not databricks or airflow) then it becomes very difficult to track any associated maven dependencies in pyspark.
This PR aims to gather that dependency information into a more useful format (pom.xml and a list of coordinates)
in simple situations the list of coordinates can be passed directly to the
--packages
option in a spark-submit command. Otherwise the POM.xml will have additional information including any repositories or exclusions for them to download the jars (mvn dependency:copy-dependencies -DoutputDirectory=./
) and pass to--jars
.The only thing that the user would need to do is either set SPARK_VERSION as an environment variable to choose which spark version of PLIBS to be installed (3.3.0, 3.4.0, 3.5.0, etc) OR else grep for {{SPARK_VERSION}} in the output files and replace it with a version string.