CharlotteJackson / DC_Crash_Bot

10 stars 7 forks source link

Write script to connect Google sheets and Postgres db #61

Open CharlotteJackson opened 3 years ago

CharlotteJackson commented 3 years ago

What is the Task

write a script that refreshes a Google sheet at a regular interval with the latest data from the crashes table on our PostGIS db

Why do we want to do this

so Michael's Tableau visualization will automatically update

How can I get started?

check stack overflow for examples :)

Definition of Done

when no manual intervention is needed to refresh the Tableau visualization with the latest data

banjtheman commented 3 years ago

@eichler4dc where is the google sheet?

eichler4dc commented 3 years ago

https://docs.google.com/spreadsheets/d/1IAEwjnRUvcGfhbzk_2_5Uso2SdhLbhht_zSdyXXBS0I/edit?usp=sharing

eichler4dc commented 3 years ago

it might be easier for you to simply create your own google sheet using this query, and I can update my viz to point to it. I'd hate to break what is up and running now.

`select reportdate, bicycle_fatalities as fatal_bicyclist, vehicle_fatalities as fatal_driver, pedestrian_fatalities as fatal_pedestrian, total_bicyclists as total_bicycles, total_pedestrians, drivers_impaired as driversimipared, drivers_speeding as speeding_involved, blockkey, 0 fatalpassenger, left(smd_id, 2) anc_id, smd_id , st_X(geography) x_coord, st_Y(geography) y_coord, intersectionid from analysis_data.dc_crashes_w_details where reportdate between date '2015-01-01 00:00:01' and current_date

order by reportdate desc`

banjtheman commented 3 years ago

Was able to create a script that

https://docs.google.com/spreadsheets/d/1_xDChirhrewq58RAlbKa02RhLQRyJ447zIpqFSGDG7A/edit#gid=1717793491

banjtheman commented 3 years ago

hmm google drive is converting smd_id's with E i.e "5E07" to a scientific notation number "5.00E+7" to combat that I added quotes around the number so it shows up as "5E07" in the spreadsheet, would that break anything in the vis @eichler4dc

eichler4dc commented 3 years ago

Let me try this out and let you know!

Michael Eichler Advisory Neighborhood Commissioner Single Member District 6E01 ‪(202) 599-6326‬

@.*** | http://eichler4dc.com | @eichler4dc http://twitter.com/eichler4dc | Neighbor Survey http://tinyurl.com/6e01-survey

On Sat, Apr 17, 2021 at 6:10 PM Banjo Obayomi @.***> wrote:

hmm google drive is converting smd_id's with E i.e "5E07" to a scientific notation number "5.00E+7" to combat that I added quotes around the number so it shows up as "5E07" in the spreadsheet, would that break anything in the vis @eichler4dc https://github.com/eichler4dc

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CharlotteJackson/DC_Crash_Bot/issues/61#issuecomment-821894590, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATJORJWKMJDJR5JJGAFPXYDTJIBL5ANCNFSM42UC4DEQ .

eichler4dc commented 3 years ago

I was trying this one and received the error I mentioned. I just added my comment to the wrong issue. Same error message: The Google Sheets service reported an unrecognized error when processing this request.
This file is too large to be exported.

As I noted in the other issue, this sheet works fine in Tableau. Was created by uploading CSV. Have to turn off the option to automatically determine field datatypes to fix the scientific notation problem. https://docs.google.com/spreadsheets/d/1IAEwjnRUvcGfhbzk_2_5Uso2SdhLbhht_zSdyXXBS0I/edit#gid=688603079

I tried manually downloading (exporting) the "dc_crash_bot" data from your gsheet and it worked fine. Not sure what the error might be.

banjtheman commented 3 years ago

My guess is because of the extra data (columns and rows) in the dataset, the CSV is larger than the allowed size which is somewhere between 10-12MB according to https://kb.tableau.com/articles/issue/error-this-file-is-too-large-to-be-exported-when-extracting-data-from-google-sheet

Cause
The ability to connect to Google Sheets greater than 10 MB is currently not built into the product. 

Here are the current sizes of the csv files

19M gdrive_auto_upload.csv <-- New sheet
12M crash_data_for_gsheet_20210403.csv  <-- Old sheet

Solutions would be to either

  1. remove the extra columns from the query
  2. change the time range to collect less data
eichler4dc commented 3 years ago

Let me experiment a bit. I was thinking that too but something else might be up.

eichler4dc commented 3 years ago

Good call on exceeding the file size. I must have been right up to the limit and only now exceeding it. Let's tryi updating the query to this one that groups by blockkey, the date rounded to the nearest month, and the single-member district ID. It brings the total dataset down a bit, not a lot. Should be enough to buy us some time. Thanks!!!

select date_trunc('month', reportdate) as reportdate, sum(bicycle_fatalities) as fatal_bicyclist, sum(vehicle_fatalities) as fatal_driver, sum(pedestrian_fatalities) as fatal_pedestrian, sum(total_bicyclists) as total_bicycles, sum(total_pedestrians) as total_pedestrians, blockkey, smd_id, count(*) crash_count , avg(st_X(geography)) x_coord, avg(st_Y(geography)) y_coord from analysis_data.dc_crashes_w_details where reportdate between date '2015-01-01 00:00:01' and current_date group by date_trunc('month', reportdate), blockkey, smd_id order by reportdate desc

banjtheman commented 3 years ago

With that query, the sheet is still too big

15M gdrive_auto_upload.csv

6 years of data too much will need to lower the time range where reportdate between date '2015-01-01 00:00:01'

Using 2019 as the start date was able to get a 5MB csv

4.9M    gdrive_auto_upload.csv
banjtheman commented 3 years ago

I have made a WDC http://35.170.77.227:8888/dc_crash_bot.html

to use the data from the query, since google drive wont work

image

eichler4dc commented 3 years ago

Neat. Okay, i was able to load this into tableau without issue. was going to upload a test to the server and see if it auto-refreshes like we want. Tableau froze on me. Might be incompatibilities of my new mac (apple chip) and Tableau not being native to that processor yet. I'm thinking if we don't have the file size constraints of using google sheets we could flip back to the previous version of the query.

select --*, reportdate, bicycle_fatalities as fatal_bicyclist, vehicle_fatalities as fatal_driver, pedestrian_fatalities as fatal_pedestrian, total_bicyclists as total_bicycles, total_pedestrians, drivers_impaired as driversimipared, drivers_speeding as speeding_involved, blockkey, 0 fatalpassenger, left(smd_id, 2) anc_id, smd_id , st_X(geography) x_coord, st_Y(geography) y_coord, intersectionid from analysis_data.dc_crashes_w_details --where reportdate between date '2021-04-15 03:35:31' and current_date where reportdate between date '2015-01-01 00:00:01' and current_date order by reportdate desc

banjtheman commented 3 years ago

Updated

image

eichler4dc commented 3 years ago

Oh, great! This is mostly working. Two issues:

1) When I can see the dates property, there are some outside my specified date range. Are you keeping the date where clause I passed along? Between 1/1/2015 and system date?

2) Can you verify that ReportDate is being formatted as such: 2017-04-18 22:20:52+00 When I pull in this data directly, my existing viz converts all the dates to 1899-12-30 00:00:00 for some reason. I think it's because I had the dates formatted one way in the past and perhaps you are passing it along in a different format? Update: Oh, I think your dates have an extra ":00" at the end. Can you trim that off?

Thanks!!!

CharlotteJackson commented 3 years ago
  1. The crash data analysis table should actually already be limited to from dates after 1-1-2015, can you send me some examples of ones that are outside the range?
  2. The ReportDate field (and all timestamp fields) includes a time zone. If that format is messing up your viz, we can update the query to cast it as a date and also convert it to US Eastern time: " cast(reportdate at time zone 'America/New_York') as date as reportdate"
banjtheman commented 3 years ago

I just did a push, to format reportdate for the viz

image

eichler4dc commented 3 years ago

Thanks for this. i tried dropping this new data into the viz and encountered a set of errors. I might need to recreate the viz at this point. it would be easier and faster than continuing to debug why this new dataset from the web data connector is incompatible with my old one. should only take me an hour or two.

eichler4dc commented 3 years ago

Before I decide to rebuild entirely, could we try adding the report date back in as another field, this time in text format? Here's the updated query:

select --*, reportdate, bicycle_fatalities as fatal_bicyclist, vehicle_fatalities as fatal_driver, pedestrian_fatalities as fatal_pedestrian, total_bicyclists as total_bicycles, total_pedestrians, drivers_impaired as driversimipared, drivers_speeding as speeding_involved, blockkey, 0 fatalpassenger, left(smd_id, 2) anc_id, smd_id , st_X(geography) x_coord, st_Y(geography) y_coord, intersectionid , to_char(reportdate, 'YYYYMMDD HH24:MI:SS') report_date_string from analysis_data.dc_crashes_w_details where reportdate between date '2015-01-01 00:00:01' and current_date order by reportdate desc

Thanks!

banjtheman commented 3 years ago

updated

image

banjtheman commented 3 years ago

Ok @eichler4dc I think i may have fixed the issue

Found all supported tableau data formats here https://tableau.github.io/webdataconnector/docs/wdc_ref_date_formats.html the +00 was invalid and was treating everything as a string. I updated the script to use the correct format and looks like the connector accepts the datetime format now correctly.

eichler4dc commented 3 years ago

this is great news. I was driving myself crazy trying to figure this out.

Before seeing this update, I pulled the data from the WDC into a new workbook and am finding it's only partial data. See below. I just queried the underlying data via pgAdmin and it matches. The underlying data is missing! @CharlotteJackson was there a change to the data loader? Or is it missing from the DC Open Data portal as well?

Perhaps I show folks how easy it is to validate whether the WDC is working as expected so you can debug it yourself instead of waiting for me to try it out in my Viz?

image
eichler4dc commented 3 years ago
image
eichler4dc commented 3 years ago

ZOMG it works. @banjtheman , thank you!!! I have to figure out one more thing before I publish it in place of my old dashboard, but here's the sneak peek: https://public.tableau.com/profile/michael.eichler#!/vizhome/DCCrashesbyBlock2010to2021-live/CrashesbyANCBlock

CharlotteJackson commented 3 years ago

Awesome! Are we good to close this issue?