Upon reviewing your best practices for Pyspark applications, I noticed that
the unit tests wouldn’t run on Spark 2.3.1
the zipfile gets generated from Pipfile instead of Pipfile.lock.
It is my understanding that using Pipfile.lock is the only way to get consistent builds. As I mentioned in one of the commit messages:
If Pipfile is altered manually after running the last
series of tests, the generated requirements.txt is not in
line with the latest run of the environment against which
was tested. The module pipenv-to-requirements solves
exactly that problem.
Note that I have no affiliation with pipenv-to-requirements. It does however seem to fill a hiatus in packaging applications such as Spark ETL jobs that were developed with Pipenv.
Other than that, I improved some spelling and used some more Pythonic constructs.
On an unrelated note, I really liked the read. I'm using a nearly identical approach to my spark jobs at my work, which is why it resonated with me and I wanted to contribute back to your “best practices”.
Upon reviewing your best practices for Pyspark applications, I noticed that
It is my understanding that using Pipfile.lock is the only way to get consistent builds. As I mentioned in one of the commit messages:
Note that I have no affiliation with
pipenv-to-requirements
. It does however seem to fill a hiatus in packaging applications such as Spark ETL jobs that were developed with Pipenv.Other than that, I improved some spelling and used some more Pythonic constructs.
On an unrelated note, I really liked the read. I'm using a nearly identical approach to my spark jobs at my work, which is why it resonated with me and I wanted to contribute back to your “best practices”.