Closed maxachis closed 3 years ago
The best way to do this is to switch as many source datasets as possible from CSVs to APIs.
The good news is our major datasets can pull from a regularly updating API!
The Just Harvest sites (Food Bucks, Fresh Access, Fresh Corners) is currently a Google Sheet that we convert to a .xlsx. Possibly we could connect directly to the google sheet for this info?
Lastly, I just realized the WIC sites appear to update weekly, but actually looks like it is stale data from 2015. Trying to find a good alternative. The state website "Find WIC Stores" website would have to be scraped to get a table of sites. https://www.pawic.com/FindWICStores.aspx
Another thought- since our data workflow exports a timestamped CSV now, does that create a problem for our Heroku workflow? like is Heroku going to be looking for the plain old merged_data.csv file?
Conor + Ellie will work on the SNAP site API.
The FindWICStores page sends an ajax request which returns json data. See view-source:https://www.pawic.com/FindWICStores.aspx "function getStores() {" on line 491.
@cgmoreno has my code for hitting the SNAP API
I've got Matt's API call code...will try to whip it into shape for next meeting! I'll be doing this in R using the jsonlite package.
I looked into the WIC request a little, the data you send should probably be: {county: 2, zip: 15212, miles: 20} Here 2 means Allegheny (probably because it is the second in the list), 15212 is pretty central. 20 miles is the max the drop-down goes to, which might just barely cover the whole county if you started from a point dead center. If the API lets you use 30 miles, that might be better.
In chrome at least, you can go to the dev console, go to Network and filter by "method:POST" to see the outgoing post request when you hit the button. As far as I know, you probably want to copy some/most of the request header values, but I don't have a lot of experience with this.
Here is some (working) python code
`import json import shlex from subprocess import run
wgetCommand = """wget -q -O -\ --method POST \ --header 'Accept: application/json, text/javascript, /; q=0.01' \ --header 'Content-Type: application/json; charset=UTF-8' \ --body-data '{"county":"2","zip":"","miles":"5"}'\ 'https://www.pawic.com/FindWICStores.aspx/FindStores'"""
args = shlex.split(wgetCommand)
results = json.loads(json.loads(run(args, capture_output=True).stdout)['d'])['Result']`
its actually a shell script, and I am ashamed that it isnt working in requests, but this should work well in the meantime
wget -q -O - --method POST \
--header 'Accept: application/json, text/javascript, /; q=0.01' \
--header 'Content-Type: application/json; charset=UTF-8' \
--body-data '{"county":"2","zip":"","miles":"5"}' \
'https://www.pawic.com/FindWICStores.aspx/FindStores' | jq -r '.d' | jq > wicresults.json
run that in terminal (no need for python) or let me know if you want me to wrap it up in an actual shell script to run as a command. this will output the results to a json. if we turn it into a shell script for real for real we can supply some params like county, zip, miles, and output filename
@mishugana ok I was able to run this but am not getting any results? It creates a wicresults.json file with nothing in it.
I feel like I am holding up this whole thing with my inexperience! Would you be able to save a json or csv to our repo for me to at least get the data processing work done? Then we can figure out how to better integrate actually running the wget request and pulling the most recent data.
If @mishugana is not able to get into that before the meeting on Wednesday, I propose we look into that as part of the agenda of our meeting. I may also want to learn more about this!
WE ARE DONE!
Come up with a workflow for how to keep things updated so that if two years from now the original people aren't around, we can resolve that stuff.