databrickslabs / dbx

🧱 Databricks CLI eXtensions - aka dbx is a CLI tool for development and advanced Databricks workflows management.
https://dbx.readthedocs.io
Other
437 stars 119 forks source link

DBX example (Python quickstart) coverage test wont run due to dependency issues #835

Open seboktamas opened 11 months ago

seboktamas commented 11 months ago

Expected Behavior

Execute code from https://dbx.readthedocs.io/en/latest/guides/python/python_quickstart/ works.

Current Behavior

When running pytest tests/unit --cov there is an exception: AttributeError: 'DataFrame' object has no attribute 'iteritems'

Steps to Reproduce (for bugs)

Follow the instructions and execute the code from https://dbx.readthedocs.io/en/latest/guides/python/python_quickstart/

Context

This is due to pyspark version is fixed and pandas version is not fixed. In pandas 'iteritems' became deprecated and removed. Upgrading pyspark (and delta-spark) to latest version will fix the issue, but first I had to fix another issue: Due to the python version is fixed in the example (to 3.9), and my environment has python 3.11, I got the following error: _Python in worker has different version 3.11 than that in driver 3.9, PySpark cannot run with different minor versions. Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set._

After setting the worker version to python3.9, it worked. (There should be a note somewhere to need to take care of this version as well.)

Your Environment

platform darwin -- Python 3.9.17