CharlotteJackson / DC_Crash_Bot

10 stars 7 forks source link

Ingest a Google sheet with traffic fatalities information #65

Closed CharlotteJackson closed 3 years ago

CharlotteJackson commented 3 years ago

What is the Task

Set up a pipeline to a Google sheet with up-to-date fatalities information

Why do we want to do this

The official crashes dataset does not accurately reflect fatality counts

How can I get started?

Identify the google sheet to use, or create one ourselves

Definition of Done

When there's an automated pipeline into the crashes database

banjtheman commented 3 years ago

This google sheet has information that can attempt to scrape https://docs.google.com/spreadsheets/d/1uCDD5ox8t4WDNB377PomsS3PS_QInaoE4_lGmqtxb-M/edit#gid=0

Once data is collected need to figure out where to ingest into db

banjtheman commented 3 years ago

Script can be run, data uploaded to s3 as well

eichler4dc commented 3 years ago

Unfortunately I can't open this google sheet in tableau. Error message: The Google Sheets service reported an unrecognized error when processing this request.
This file is too large to be exported.

Here's the URL to the sheet that I can successfully open. Is there any way to do a "diff" of the two to see what the issue is? Perhaps the one you're exporting programmatically has extra rows or columns in it that mine doesn't have when exported from Postgres via pgAdmin csv export.

eichler4dc commented 3 years ago

sorry, i'm an idiot. submitted feedback on the wrong issue.

eichler4dc commented 3 years ago

hi. i'm noticing this data in the Source Data schema hasn't been updated since 4/27/2021. Is this data load not automated yet? Or did something change to disrupt the data flow?

banjtheman commented 3 years ago

hi. i'm noticing this data in the Source Data schema hasn't been updated since 4/27/2021. Is this data load not automated yet? Or did something change to disrupt the data flow?

what data source are you referring to? Can you provide a URL?

eichler4dc commented 3 years ago

sorry, this is for the Families for Safe Streets data that comes from a google sheet and gets loaded into source_data.dc_fss table

source of data is here: https://docs.google.com/spreadsheets/d/1uCDD5ox8t4WDNB377PomsS3PS_QInaoE4_lGmqtxb-M/edit#gid=0

banjtheman commented 3 years ago

The data was being uploaded to a private bucket, just added a script to upload the data to a public bucket as well

https://dc-crash-bot-public.s3.amazonaws.com/source-data/dc-fss/dc_fss.csv

CharlotteJackson commented 3 years ago

i need to throw a bash script/cron job on the EC2 instance that pulls the csv into the database every night! Doing that now. Also Banjo about to shoot you the certificate and the URL to it.

On Mon, Jul 26, 2021 at 9:41 PM Banjo Obayomi @.***> wrote:

The data was being uploaded to a private bucket, just added a script to upload the data to a public bucket as well

https://dc-crash-bot-public.s3.amazonaws.com/source-data/dc-fss/dc_fss.csv

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/CharlotteJackson/DC_Crash_Bot/issues/65#issuecomment-887143122, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKSPWEVYOGHDTKBBVFHCYBTTZYFD3ANCNFSM427XNEWA .