datarootsio / skeleton-pyspark

A best-practices first project template that allows you to get started on a new pyspark project
MIT License
13 stars 4 forks source link

build: add the build command to the MAKE -- Wheels #25 #26

Closed fbraza closed 2 years ago

fbraza commented 2 years ago

This a pull request to answer @sdebruyn raised issue.

I quickly test the .whl on databricks and from there you can use your skeleton functions.

Cheers

vikramaditya91 commented 2 years ago

Hey @fbraza . Quick question. How did you test that the .whl works on databricks? Did it not complain on the missing dependencies? eg. typer?

fbraza commented 2 years ago

Hello @vikramaditya91

No it did not complain about that. the .whl is just like a zip file that contains compiled requirements as defined in your pyproject.toml file. Everything should be present in the .whl. But I did not use your package as if I was running it from a terminal with Typer but rather I imported the three ETL functions template that you defined. And they got imported without any issue.

I don't think it is relevant to use the CLI command in the databaricks environment. Moreover Spark session is already instantiated there. So that is why I just tested if it was possible to use the functions you defined there and it worked. ^_^

vikramaditya91 commented 2 years ago

@fbraza It is not just for it to be used as a CLI. For example, if the pyproject.toml says that it needs numpy or boto3, and I am using numpy/boto3 for some reason in the ETL job, then it would complain that they are missing from the environment I suppose. To package the list of dependencies in the pyproject.toml, I had this make pack_req which packages the dependencies in a package.zip file using the Docker image.

The question is for @sdebruyn I think. The poetry build commands creates a .whl and .zip file but they only contain the source code which is contained in here. It also contains a METADATA file which tells it about the dependencies. If Databricks/Synapse is smart enough to infer and install these dependencies based on METADATA, then great. If not, I believe, the dependencies which are packaged in the make pack_req should be sent to the Databricks/Synapse

fbraza commented 2 years ago

Thank you for your feedback @vikramaditya91. Concerning your point, if you install the package from the wheel it does also install the dependencies. You can make the test locally. Add boto3 and numpy, build the .whl and then pip install your_whell.whl and you can see that it installs all dependencies. So this should not be a problem once you are in Databricks.

On databricks you can use dbutils.library.install("dbfs:/path/to/your/library.whl") or and only for .whl use the pip command %pip install /dbfs/path/to/your/library.whl.

vikramaditya91 commented 2 years ago

@fbraza Indeed, I just confirmed this with a Databricks notebook. When a .whl package is uploaded, it installs all the dependencies based on the METADATA file. (pip install uses the METADATA file too). So this PR is good for me.

Confirming it locally would not have been sufficient, because you want to emulate how Databricks would install the package. If it simply did a unzip on .whl file, it would have been insufficient.

fbraza commented 2 years ago

Good ! Yes I was trying on databricks community but could not have a cluster spinning up ^_^.

Thx for having made the test yourself !