AlexIoannides / pyspark-example-project

Implementing best practices for PySpark ETL jobs and applications.
1.56k stars 672 forks source link

Hard changes in structure of the project and documentation. #7

Closed nkittsteiner closed 5 years ago

nkittsteiner commented 5 years ago

Why did you change this template so drastically? The project structure and documentation was great. Now it's confusing (pipenv).

AlexIoannides commented 5 years ago

Hi,

Thanks for the feedback - it's all good and interesting to hear.

Basically, the update reflects how I'm running ETL jobs now (at Perfect Channel). I'm a big fan of Pipenv as it makes day-to-day development much easier (in my opinion) as well as making the bash script 'less hacky'. There are also improvements to the start_spark() function that make it easier to use when debugging or using from within IPython, etc.

BTW - there's still a section on Structure of an ETL Job in the README (is this what you were referring to?), and the older version can be found as v0.1 under Releases (or Branches -> Tags -> v0.1).

I'm going to re-open this for a bit to encourage other opinions.

Alex

nkittsteiner commented 5 years ago

Sorry Alex for being so critic about this but it was a surprise for me. About the Structure of an ETL Job it's OK, I was confused. About passing parameters to the ETL Job in my case I use STDIN params because are more flexible and could be integrated with tools like Airflow or NiFi, but it's just my way to do this. BTW I want to thank you for this super cool project.

mehd-io commented 5 years ago

Hi @AlexIoannides , nice work for the template. I was actually doing the same on my side and just found your project. I checked rapidly the changes, and i have a few comments :

Otherwise, the sh scripts to build dep and logging wrapper are great :)

AlexIoannides commented 5 years ago

I would rename "dependencies" folder by "helpers" bc it makes really the stuff confusing with other scripts/readme that talks about python dependencies or jar dependencies.

That's a very good point.

"how to pass configuration parameters to a PySpark job" - I understood it as STDIN params. Of course having conguration files is always important for path configurations and so on. I can provide a pull request on this.

I have probably been indoctrinated by the 'proper' software developers that I sit with, but the configuration parameter should be source controlled, which passing them via a dedicated file enforces. Plus, some of the configs we use at Perfect Channel are so large, that passing them into STDIN isn't viable. But you're right, in an idea world there'd be support for both, because bash scripts that contain the STDIN can themselves be source controlled, etc.