chihacknight / chn-ghost-buses

"Ghost buses" analysis project through Chi Hack Night
https://github.com/chihacknight/breakout-groups/issues/217
MIT License
19 stars 14 forks source link

[Data] Automate schedule downloads #18

Open lauriemerrell opened 1 year ago

lauriemerrell commented 1 year ago

In addition to scraping realtime data every 5 minutes, we should scrape the GTFS schedule (static) data on a daily basis so we don't have to get historical versions after the fact.

We should write a Lambda function that will scrape the CTA schedule GTFS data from https://www.transitchicago.com/downloads/sch_data/google_transit.zip every day.

Acceptance criteria for this should just be a Python script that will scrape the zipfile as bytes and write it to S3.

Once that's ready we should make a follow up ticket to deploy to AWS (has to be done by me, @lauriemerrell) and another follow up ticket to describe desired follow up processing.

mrscraps13 commented 1 year ago

wanted feed back on this, please let me know! @lauriemerrell @KyleDolezal """ with open("infile", "rb") as in_file, open("out-file", "wb") as out_file: chunk = in_file.read(chunk_size)

if chunk == b"":
    break

out_file.write(chunk)

"""

KyleDolezal commented 1 year ago

@mrscraps13 It looks good to me. I can see similar working examples, such as here. Is this code part of a branch? I'm wondering if I could see it in context.

lauriemerrell commented 1 year ago

Agree with @KyleDolezal, looks good but wondering about context-- I think that in my day job where we download feeds, we just use requests and basically request.get(<SCHEDULE_URL>) and then just save the response content. Here's an example: https://github.com/cal-itp/data-infra/blob/main/airflow/dags/gtfs_downloader/download_data.py#L35-L78, it's a bit hard to follow because there's some other config stuff going on but maybe helpful?

mrscraps13 commented 1 year ago

im a bit lost about the 'context', which other pieces. the way i thought about this was reading the file by chunks. could someone provide a bit more guidance :)