CityOfLosAngeles / public-health-prototype

prototype metrics + dashboards around public health
Apache License 2.0
1 stars 5 forks source link

transferring ETL to Civis #45

Closed sherryshenker closed 4 years ago

sherryshenker commented 4 years ago

thread to discuss which (if any) ETL processes should be transferred to Civis. I think in order for us to make recommendations, we'd like to first get a better understanding of what each of the ETL processes are composed of (sources, transformations), current pain points, complexity, and any anticipated issues of transferring / rewriting. @hunterowens

hunterowens commented 4 years ago

The main dataset that is ETL'd in this process sections of the 311 dataset.

I think best way forward is to figure out a way to go from Socrata 311 full to Redshift and then add views as needed?

Otherwise, all DAGS should be in the public-health folder on aqueduct.

sherryshenker commented 4 years ago

Yes I think that makes sense. I have some code for pulling from Socrata that currently works with public data sets. Is this data set private?

For other processes outside of the dashboard, are there some high priority connections you'd want to start with?

hunterowens commented 4 years ago

The "main" 311 dataset is private, with each year being a view. You can see a WIP DAG in Cityoflosangeles/aqueduct#130 that used the socrata api plus some keys to download it.

sherryshenker commented 4 years ago

got it. looks like thats happening here I think we'll need to add the username, password and API token as credentials in Platform. We have a general account (cityoflosangelesrobot@civisanalytics.com) that will be used for all production code. Would it be possible for you to share a set of credentials with me, and then I can add to the account and test the API pull?

hunterowens commented 4 years ago

@sherryshenker yes, but that code is due for a refactor.

I can share socrata username/ PW by adding credentials to civis, right?

hunterowens commented 4 years ago

ie, we should probably consider the existing code busted. I think the best way is probably to use the shell script in that PR I linked above but to push into redshift first, then ESRI / PostGIS / Models whatever

hunterowens commented 4 years ago

the big one liner here is


curl https://data.lacity.org/api/views/69nn-wkid/rows.csv?accessType=DOWNLOAD \
     -u '$SOCRATA_USERNAME:$SOCRATA_PASSWORD' \
     -o /tmp/service-requests.csv

which downloads all the 311 requests to date as a CSV

hunterowens commented 4 years ago

this is complete.