AlexIoannides / pyspark-example-project

Implementing best practices for PySpark ETL jobs and applications.
1.65k stars 701 forks source link

Prevent untested dependencies from being packaged #9

Closed oliverw1 closed 5 years ago

oliverw1 commented 5 years ago

Upon reviewing your best practices for Pyspark applications, I noticed that

  1. the unit tests wouldn’t run on Spark 2.3.1
  2. the zipfile gets generated from Pipfile instead of Pipfile.lock.

It is my understanding that using Pipfile.lock is the only way to get consistent builds. As I mentioned in one of the commit messages:

If Pipfile is altered manually after running the last series of tests, the generated requirements.txt is not in line with the latest run of the environment against which was tested. The module pipenv-to-requirements solves exactly that problem.

Note that I have no affiliation with pipenv-to-requirements. It does however seem to fill a hiatus in packaging applications such as Spark ETL jobs that were developed with Pipenv.

Other than that, I improved some spelling and used some more Pythonic constructs.

On an unrelated note, I really liked the read. I'm using a nearly identical approach to my spark jobs at my work, which is why it resonated with me and I wanted to contribute back to your “best practices”.

AlexIoannides commented 5 years ago

Thanks Oliver - this looks great. I'll take a proper look on the weekend.

AlexIoannides commented 5 years ago

Cherry-picked in #10 .