SimpleDataLabsInc / prophecy-build-tool

Prophecy-built-tool (PBT) allows you to quickly build projects generated by Prophecy (your standard Spark Scala and PySpark pipelines) to integrate them with your own CI / CD (e.g. Github Actions), build system (e.g. Jenkins), and orchestration (e.g. Databricks Workflows).
https://pypi.org/project/prophecy-build-tool
Apache License 2.0
28 stars 3 forks source link

added functionality to add pom.xml and list of coordinates to pyspark WHLs for dependency tracking #102

Closed neontty closed 3 months ago

neontty commented 3 months ago

If users are deploying WHL artifacts on a non-standard platform (not databricks or airflow) then it becomes very difficult to track any associated maven dependencies in pyspark.

This PR aims to gather that dependency information into a more useful format (pom.xml and a list of coordinates)

in simple situations the list of coordinates can be passed directly to the --packages option in a spark-submit command. Otherwise the POM.xml will have additional information including any repositories or exclusions for them to download the jars (mvn dependency:copy-dependencies -DoutputDirectory=./) and pass to --jars.

The only thing that the user would need to do is either set SPARK_VERSION as an environment variable to choose which spark version of PLIBS to be installed (3.3.0, 3.4.0, 3.5.0, etc) OR else grep for {{SPARK_VERSION}} in the output files and replace it with a version string.

neontty commented 3 months ago

fyi I haven't made a test suite yet but I am testing with this repo https://github.com/SimpleDataLabsInc/rj-test-spark-submit which contains both project level and pipeline level maven dependencies. Will do soon... but maybe needed after this PR.

the pom.xml is valid except for the {{SPARK_VERSION}} if it is not set at runtime image

neontty commented 3 months ago

hi @jainabhinav can you please approve when you have the chance? I have moved the option behind a flag