GreenInfo-Network / nyc-crash-mapper-etl-script

Extract, Transform, and Load script for fetching new data from the NYC Open Data Portal's vehicle collision data and loading into the NYC Crash Mapper table on CARTO.
3 stars 0 forks source link

Create monthly database backup task #19

Open danrademacher opened 5 years ago

danrademacher commented 5 years ago

Python script to do the following:

Then

┆Issue is synchronized with this Asana task

danrademacher commented 5 years ago

@fahadkirmani for this task, take a look at the CARTO.com SWL API, https://carto.com/developers/sql-api/

You can see how we're doing the incremental updates in https://github.com/GreenInfo-Network/nyc-crash-mapper-etl-script/blob/master/main.py and another set of tasks over at https://github.com/GreenInfo-Network/nyc-crash-mapper-etl-script/tree/master/fixtallies

So for this we'd end up with a monthlybackup.py that would run using Heroku Scheduler. The main challenge here I think is going to be dealing with timeouts or limits on the SQL API given the size of the table. We're not sure yet if duplicating the table within CARTO will run into those or be exempt. Part of the work is to see if that's an issue and then figure out workarounds.

The CARTO credentials are in the env variables that you should be able to access in Heroku.

Let's use this issue for discussion and findings, and then we can pull in my lead dev once you've gotten to the point of having questions.

fahadkirmani commented 5 years ago

@danrademacher I have gone over the links you have mentioned and also configured the environment on my ubuntu 16.04 Xenial . When i tried running any query on carto https://carto.com/developers/sql-api/ I end up in error i think because of the CARTO_API_KEY which i think i need to get the data from it for comparing the data of SODA and CARTO sources. I have also installed the Heroku environment on the machine. I am also logged in Heroku. Also cloned the git python code of nyc-crash-mapper-etl-script. Kindly i need some help in moving forward for running these scripts in my environment.

danrademacher commented 5 years ago

You can find the CARTO API key under "config vars" here: https://dashboard.heroku.com/apps/nyc-crash-mapper-etl/settings

fahadkirmani commented 5 years ago

I am trying this request https://{username}.carto.com/api/v2/sql?q={SQL statement}&api_key={api_key} to get the results but getting this message {"error":["Unauthorized"]} Using the CARTO API KEY as you mentioned above and user name and table name from this script 2-update_carto.py Are these credentials are okay?

danrademacher commented 5 years ago

Those credentials should work since they are the ones running the daily script (which I just confirmed in Heroku logs is running fine), so let's have @gregallensworth take a look at this and see if he can get you going. Gregor, I have a new job to tie this time to, CHEKPEDS:New Dev Training

gregallensworth commented 5 years ago

Working for me. I wrote a quick shell script which creates a URL to SELECT COUNT(*) FROM crashes_all_prod and confirmed that it displays the expected JSON code, and that if I supply an incorrect APIKEY that I do not get the expected JSON code.

#!/bin/sh

APIKEY="ABCXYZ"
USERNAME="chekpeds"
SQL="SELECT COUNT(*) FROM crashes_all_prod"

URL="https://$USERNAME.carto.com/api/v2/sql?api_key=$APIKEY&q=$SQL"

echo "Connecting..."
wget --quiet -O - "$URL"
echo ""
echo "Done"

The expected output is a row count (which will increase as more crashes are added each day):

{"rows":[{"count":1548595}],"time":0.597,"fields":{"count":{"type":"number"}},"total_rows":1}

@fahadkirmani If you run a simple SELECT query as I did, does it work or not? If it does, then the issue would be permissions for your given SQL statement. If it does not, then you may have copied the API key incorrectly.

fahadkirmani commented 5 years ago

@gregallensworth Yes i got successful response of the request https://$USERNAME.carto.com/api/v2/sql?api_key=$APIKEY&q=$SQL

{"rows":[{"count":1548595}],"time":0.499,"fields":{"count":{"type":"number"}},"total_rows":1} Thanks for help.

danrademacher commented 5 years ago

@fahadkirmani have you been making progress here? Based on the last comment, it seemed like we resolved the API access issue. Have you been working on this since?

fahadkirmani commented 5 years ago

@danrademacher Yes i got success in it and i did this task https://github.com/GreenInfo-Network/nyc-crash-mapper-etl-script/tree/master/fixtallies and successfully updated the carto data between this duration 2019-01-01 to 2019-03-28 . Also worked on it shell script so all this process can be run using shell script for my ease. python 2-update_carto.py Load crash_diffs.csv 4099198 {"rows":[],"time":0.791,"fields":{},"total_rows":1} 4098296 {"rows":[],"time":0.636,"fields":{},"total_rows":1} 4059929 {"rows":[],"time":0.626,"fields":{},"total_rows":1} 4097867 {"rows":[],"time":0.624,"fields":{},"total_rows":1} 4087063 {"rows":[],"time":0.62,"fields":{},"total_rows":1} 4070120 {"rows":[],"time":0.631,"fields":{},"total_rows":1} 4096730 {"rows":[],"time":0.631,"fields":{},"total_rows":1} 4096994 {"rows":[],"time":0.609,"fields":{},"total_rows":1} 4092290 {"rows":[],"time":0.61,"fields":{},"total_rows":1} 4071106 {"rows":[],"time":0.634,"fields":{},"total_rows":1} 4093707 {"rows":[],"time":0.624,"fields":{},"total_rows":1} 4096017 {"rows":[],"time":0.62,"fields":{},"total_rows":1} 4097447 {"rows":[],"time":0.652,"fields":{},"total_rows":1} 4098281

fahadkirmani commented 5 years ago

Hi @danrademacher ,

There two socrata_id(4181883,4025853) are not present in CARTO but they are there in SODA unique_key(4181883,4025853). I have updated the records but I think they(scripts I have) only update the data on CARTO from SODA but do not insert new records in CARTO.

09/10/2018 --> 4181883 04/18/2018 --> 4025853

image

https://postimg.cc/PL85vkmq

Can you check why some IDs are getting skipped as I don't see any information is missing for those two records in SODA . Also lon,lat information is present there in SODA for those two rows.

gregallensworth commented 5 years ago

The leading reason that a crash would not have been loaded into CARTO, is that some crashes are not logged into Socrata for weeks or even months after the fact.

Original versions of the ETL script ran daily, and looked for crashes dated the prior day (e.g. today, it would query date >= '2019-08-26'). We didn't notice for quite some time, that this was missing crashes. The ETL script these days looks back two months as a balance between finding slow data entry and trying to digest too much data and crashing.

Crash 4181883 definitely fits into that profile. They waited 11 months to enter it, and we look back only 2 months.

"date": "2018-09-10T00:00:00.000",
":created_at": "2019-08-26T23:02:03.529Z"
":updated_at": "2019-08-26T23:02:44.636Z"

Crash 4025853 they waited 16 months:

"date": "2018-04-18T00:00:00.000",
":created_at": "2019-08-26T23:05:17.894Z",
":updated_at": "2019-08-26T23:05:58.222Z",

In order to address these, I wrote the backlog/ script, which you'll find in this same repository. This accepts a given year and month (e.g. "2018-09") and specifically scans Socrata for records in that month which are not in CARTO.