Closed nkittsteiner closed 5 years ago
Hi,
Thanks for the feedback - it's all good and interesting to hear.
Basically, the update reflects how I'm running ETL jobs now (at Perfect Channel). I'm a big fan of Pipenv as it makes day-to-day development much easier (in my opinion) as well as making the bash script 'less hacky'. There are also improvements to the start_spark()
function that make it easier to use when debugging or using from within IPython, etc.
BTW - there's still a section on Structure of an ETL Job
in the README (is this what you were referring to?), and the older version can be found as v0.1
under Releases (or Branches -> Tags -> v0.1).
I'm going to re-open this for a bit to encourage other opinions.
Alex
Sorry Alex for being so critic about this but it was a surprise for me. About the Structure of an ETL Job
it's OK, I was confused.
About passing parameters to the ETL Job
in my case I use STDIN params because are more flexible and could be integrated with tools like Airflow or NiFi, but it's just my way to do this. BTW I want to thank you for this super cool project.
Hi @AlexIoannides , nice work for the template. I was actually doing the same on my side and just found your project. I checked rapidly the changes, and i have a few comments :
Otherwise, the sh scripts to build dep and logging wrapper are great :)
I would rename "dependencies" folder by "helpers" bc it makes really the stuff confusing with other scripts/readme that talks about python dependencies or jar dependencies.
That's a very good point.
"how to pass configuration parameters to a PySpark job" - I understood it as STDIN params. Of course having conguration files is always important for path configurations and so on. I can provide a pull request on this.
I have probably been indoctrinated by the 'proper' software developers that I sit with, but the configuration parameter should be source controlled, which passing them via a dedicated file enforces. Plus, some of the configs we use at Perfect Channel are so large, that passing them into STDIN isn't viable. But you're right, in an idea world there'd be support for both, because bash scripts that contain the STDIN can themselves be source controlled, etc.
Why did you change this template so drastically? The project structure and documentation was great. Now it's confusing (pipenv).