data-mississippi / data-ms-app

core app for data platform
1 stars 0 forks source link

Home Page proof of concept and ETL POC #23

Closed smcalilly closed 3 years ago

smcalilly commented 3 years ago

Overview

Context

First let me give a quick overview of this application.

The backend is a Django app. It serves JSON to the React frontend.

I have several goals for the current iteration-in-progress:

PR

There are some things that are wrong with this application's code that isn't include in this PR, especially since it's been where I've learned Django, Docker, and how to do stuff with maps (and don't even get me started on that messy README). Feel free to pick apart anything if you'd like, but it might give you more work than you want. With that said, this PR needs the most help with the data and ETL work.

ETL

I'm trying to download demographic data for Mississippi counties. I'm getting surveys from multiple years so that I can compare the years and make some visualizations with the data. This is all happening in this processing script. (I'm probably doing this all wrong.)

I'm also fetching some geojson. (Note that I'm storing the geojson for a general map of Mississippi in the React app, so that it's served with that code in order to load the home page map the most efficiently -- at least this is how it was working like two months ago). The rest of the geojson will be stored in a database and served via a JSON API.

I have yet to advance to the level of Make where I can do these things (but it makes more sense after reading the doc again today):

Testing

smcalilly commented 3 years ago

@hancush I can't express enough how much I appreciate this code review. I'm confused by what I was trying to do with this pipeline (why am I executing the python script twice?), so it must've been difficult for you to understand, too... I've made some improvements to the Python and now figuring out the Make stuff. I'll re-request a review once it's ready.

smcalilly commented 3 years ago

@hancush This is ready for review. You should be able to do make all and it will create output data, but you will first need to cd etl so you're in the correct directory (changing that is on my to-do). The data is currently going to the backend and frontend directories and this confuses me, so I need to rethink that strategy before I make any big changes (especially the ~/backend/raw dir because it has some data that I'm already using).

Some questions I have:

smcalilly commented 3 years ago

@hancush It's getting better. This branch is in a place where you can pull down the code and run docker-compose run --rm backend make all.

Currently, the pipeline is writing to one place, an output directory. It doesn't yet load the data into the application's DB, but that's my next step. This might entail some design changes, so I need to think more about how I want to do this.

I also ran into some docker gotchas. This gave me a chance to think more about how to design docker images/have some ideas on improvement. (For example, the etl stuff lives in the backend directory, because this directory has its own docker image due to this app's docker design. For now, it's easiest to package the etl code into that image.)

In your notes, you said:

Don’t forget to use the phony target for recipes that don’t generate a file

Can you explain where to do this, again? I don't remember the specifics about this note.

And, please let me know if there are any other ways I can improve this pipeline, or weird things about.

smcalilly commented 3 years ago

@hancush I've reached the goal where you can run docker-compose run --rm backend make all. This single command will load the data into the database, using a custom management command (I copied one of the patterns in an ETL example you showed me). Let me know if you have any problems and see areas of improvement.

There are a few issues that came out of this work, so I need to focus on improving those. I especially need to learn more about geojson and postgis, because I'm doing some naive stuff with that. For now, I think this goal has met the objective of "Learn ETL". (At least, I'm familiar with it, but don't really know, you know?) I've identified some ways to improve the data I need, which will require some non-ETL decisions/changes (as well as some ETL changes). So, I think this PR is ready to come in -- this code is a solid basis for some next steps, and merging the code allow me to regroup/refocus.

smcalilly commented 3 years ago

Thanks @hancush! I learned a lot from your review, especially about Make and Django. Got a lot of things in the backlog to improve now.