Closed tommylees112 closed 6 years ago
Hi Tommy, it's good to raise questions like this because they suggest ways in which the documentation can be sharpened.
- Where do I put this? Is it inside the
make_dataset.py
file?
Yes, you can put it in this or any other file, but just reading in the data is sort of boring and most scripts will then output it somewhere, which is where the real action happens. On a concrete level, you should check out the documentation for Click which is a neat way of making Python scripts more CLI-friendly and is used in our example make_dataset.py
.
- How do I run this? Where is the 'controller' that runs all of my functions in the correct order? Where is my interface to interact with the code? I'm assuming it's meant for the command line?
To answer your question in a sort of brief way, there's a way of orchestrating data flows that treats a script like make_dataset.py
as a building block that does one thing well (see also: the Unix philosophy). Specifically, it takes one or more inputs (e.g. raw data file) and spits out one or more outputs (e.g. cleaned data file).
But you need a way to tie together all these building blocks into a DAG, which is where Make (or other another tool like it) comes in. We write a little about that here in the docs but I like how Mike Bostock (of D3 fame) puts it: https://bost.ocks.org/mike/make/
The dream is being able to go into your project directory, type make report
or what have you and watch the well oiled machine spin into action downloading new data, running analyses, training models, and outputting finished products.
(Looks like you're a PhD student, so it may be interesting to hear that some have even automated their PhD dissertation building: data -> code -> figures -> LaTeX -> pdf!)
This is a deep topic so not sure a Github issue is the best place to discuss further, but feel free to email me, my contact info is easy to find.
Edit: adding more
Hi isms this has been incredibly helpful thank you! I have managed to get the project up and running and I have been really impressed with the template. The best thing is that I am working with collaborators, none of us have a software engineering background. This means that I have been somewhat responsible for structuring the code but I can point people to your page to explain why I have chosen to do things in a particular way.
I do think for absolute beginners to Software Engineering (which many PhD students are) it would be great to have some more information about where the code is written and executed. Maybe that's not fair because I understood it eventually but it might be worth thinking about. I read the Mike Bostock page about the Makefile and the other references in the docs that explain the R projects that are looking to do something similar.
Yesterday I also came across this article:. It looks at how to use Docker containers to recreate your environment. I had an issue with your script because it wasn't playing nicely with my conda environment, and I wanted to use the conda export
function to write a environment.yml
file to share the packages required for my project.
These are all minor issues! Thank you for introducing me to structuring my projects! As I move forwards I hope that I can keep using your template and maybe even contribute once I have something useful to add!
Tommy
@tommylees112 Great! Closing the ticket for now but will plan to include this clarification when working on the docs.
I am new to building packages / data pipelines. I have mostly been writing my analysis in notebooks. I have myself asked the questions that you pose in your introductions to your template.
But how do I write my functions?
Say if I have a simple function:
Where do I put this? Is it inside the
make_dataset.py
file?How do I run this? Where is the 'controller' that runs all of my functions in the correct order? Where is my interface to interact with the code? I'm assuming it's meant for the command line?
So do I enter each file manually from the command line?
python src/data/make_dataset.py
Sorry if these are all very basic questions but I have been searching and can't find the help I need. I'm guessing this is more a problem with not knowing the correct question to ask.