MrPowers / mack

Delta Lake helper methods in PySpark
https://mrpowers.github.io/mack/
MIT License
303 stars 44 forks source link

Brainstorm correct ways to include PySpark & Delta dependencies in pyproject.toml file #48

Open MrPowers opened 1 year ago

MrPowers commented 1 year ago

Users have to supply a correct combination of Spark & Delta Lake versions for their setup to work, see the compatibility matrix.

Mack depends on PySpark & Delta Lake. We want Mack to work with a variety of Spark & Delta Lake combinations.

Here's how the dependencies are currently specified in the pyproject.toml file:

[tool.poetry.dependencies]
python = "^3.9"

[tool.poetry.dev-dependencies]
pre-commit = "^2.20.0"
pyspark = "3.3.1"
delta-spark = "2.1.1"
pytest = "7.2.0"
chispa = "0.9.2"
pytest-describe = "^1.0.0"

I'm not sure the best way to specify dependencies using Poetry to give our users the best Mack download experience. Thoughts?

squerez commented 1 year ago

I am not sure what would be the correct way, but maybe we could apply something like:

[tool.poetry.extras]
delta2.2-spark3.3 = ["delta-spark^2.2.0", "pyspark^3.3.1"]
delta2.1-spark3.3 = ["delta-spark^2.1.1", "pyspark^3.3.1"]

and then the user would just need to run poetry update; poetry install -E delta2.1-spark3.3 to install desired dependencies.

One of the pitfalls would be making sure that python3.9 works for all dependencies combinations and to create tests for each combination to make sure it works with the application code? What do you think?

I haven't given this idea a try yet, but looking at this issue gave me the impression that this could work.

MrPowers commented 1 year ago

@joao-fm-santos - this blog post has more context on the issue from a usability perspective.

Mack will typically be included as a dependency in other files. I'm not sure how we setup a Python project to correctly install a specific Delta Lake version based on the PySpark version that the user specified....

@danielbeach - FYI, we're looking into this issue.

@alexott - feel free to provide suggestions.

squerez commented 1 year ago

@MrPowers thanks for the blog post, really helpfull! Unless I understood the problem incorrectly, I believe adding extras would be a good way to solve this issue, as it allows users to use common poetry syntax to install dependencies, allowing to choose what version they prefer.

For example, a user could:

[tool.poetry.dependencies]
mack= {version = "*", extras = ["delta2.2-spark3.3"]}

For the pip installation, I believe we would need to change setup.cfg to include extras like so.

I have not tried this, but let me know if I am missing the point here!

MrPowers commented 1 year ago

@joao-fm-santos - yea, extras could be the right way to solve this. I don't know.

We need a solution that will work in a variety of execution contexts:

One of my other projects uses a library called findspark. Is it possible we need a library like finddelta?