How do I write my own code in `src/data/make_dataset.py`?

tommylees112 commented 6 years ago

I am new to building packages / data pipelines. I have mostly been writing my analysis in notebooks. I have myself asked the questions that you pose in your introductions to your template.

Are we supposed to go in and join the column X to the data before we get started or did that come from one of the notebooks?
Come to think of it, which notebook do we have to run first before running the plotting code: was it "process data" or "clean data"?
Where did the shapefiles get downloaded from for the geographic plots?

But how do I write my functions?

Say if I have a simple function:

def read_in_data():
  df = pd.read_csv(data_dir+"example.csv")
  return df

Where do I put this? Is it inside the make_dataset.py file?
How do I run this? Where is the 'controller' that runs all of my functions in the correct order? Where is my interface to interact with the code? I'm assuming it's meant for the command line?

So do I enter each file manually from the command line? python src/data/make_dataset.py

Sorry if these are all very basic questions but I have been searching and can't find the help I need. I'm guessing this is more a problem with not knowing the correct question to ask.

isms commented 6 years ago

Hi Tommy, it's good to raise questions like this because they suggest ways in which the documentation can be sharpened.

Where do I put this? Is it inside the make_dataset.py file?

Yes, you can put it in this or any other file, but just reading in the data is sort of boring and most scripts will then output it somewhere, which is where the real action happens. On a concrete level, you should check out the documentation for Click which is a neat way of making Python scripts more CLI-friendly and is used in our example make_dataset.py.

How do I run this? Where is the 'controller' that runs all of my functions in the correct order? Where is my interface to interact with the code? I'm assuming it's meant for the command line?

To answer your question in a sort of brief way, there's a way of orchestrating data flows that treats a script like make_dataset.py as a building block that does one thing well (see also: the Unix philosophy). Specifically, it takes one or more inputs (e.g. raw data file) and spits out one or more outputs (e.g. cleaned data file).

But you need a way to tie together all these building blocks into a DAG, which is where Make (or other another tool like it) comes in. We write a little about that here in the docs but I like how Mike Bostock (of D3 fame) puts it: https://bost.ocks.org/mike/make/

The dream is being able to go into your project directory, type make report or what have you and watch the well oiled machine spin into action downloading new data, running analyses, training models, and outputting finished products.

(Looks like you're a PhD student, so it may be interesting to hear that some have even automated their PhD dissertation building: data -> code -> figures -> LaTeX -> pdf!)

This is a deep topic so not sure a Github issue is the best place to discuss further, but feel free to email me, my contact info is easy to find.

Edit: adding more

tommylees112 commented 6 years ago

Hi isms this has been incredibly helpful thank you! I have managed to get the project up and running and I have been really impressed with the template. The best thing is that I am working with collaborators, none of us have a software engineering background. This means that I have been somewhat responsible for structuring the code but I can point people to your page to explain why I have chosen to do things in a particular way.

I do think for absolute beginners to Software Engineering (which many PhD students are) it would be great to have some more information about where the code is written and executed. Maybe that's not fair because I understood it eventually but it might be worth thinking about. I read the Mike Bostock page about the Makefile and the other references in the docs that explain the R projects that are looking to do something similar.

Yesterday I also came across this article:. It looks at how to use Docker containers to recreate your environment. I had an issue with your script because it wasn't playing nicely with my conda environment, and I wanted to use the conda export function to write a environment.ymlfile to share the packages required for my project.

These are all minor issues! Thank you for introducing me to structuring my projects! As I move forwards I hope that I can keep using your template and maybe even contribute once I have something useful to add!

Tommy

isms commented 6 years ago

@tommylees112 Great! Closing the ticket for now but will plan to include this clarification when working on the docs.

drivendataorg / cookiecutter-data-science

How do I write my own code in `src/data/make_dataset.py`? #130