cal-itp / mobility-marketplace

A one-stop resource for modern, equitable and accessible public transit products and services.
https://www.camobilitymarketplace.org/
GNU Affero General Public License v3.0
5 stars 1 forks source link

Automatically pull the Transit Agency dataset from the Cal-ITP warehouse #518

Closed allejo closed 6 months ago

allejo commented 7 months ago

This PR focuses almost exclusively on automating the fetching of Transit Agency (TA) provider data from the Cal-ITP data warehouse. It also updates the Provider Map page to fit the new data structure from the warehouse and some content updates.

Closes https://github.com/cal-itp/data-infra/issues/28

Workflow Overview

How the Worfklow works

  1. The first thing this workflow takes care of is authenticating with Google with a dedicated service account, whose access key is stored in this repo's GCP_SA_KEY environment variable.
  2. After we've authenticated, I setup and install Google Cloud SDK because we need the bq CLI tool to be able to download data from the warehouse
  3. I use the bq tool to download the mart_transit_database.dim_mobility_mart_providers table as a CSV file.
  4. Small caveat, due to issue #315160970, some warning messages end up polluting the CSV file, making it unparseable as CSV.
  5. There is a workaround step with a sed command that deletes everything from the start of the file up until we find the name of the first column of the dataset (i.e. agency_name).
  6. I then set up Python and Pandas, to do some quick data transformation and calculations in an embedded Python script.
  7. I parse the providers.csv file that was downloaded from the warehouse and lookup to see what TAs have an empty counties_served column. I then use the cities_to_county.csv to have a lookup table of what counties a given city belongs to. This is necessary since the warehouse does not have counties_served and hq_county for every TA.

    I am working off the assumption that if a TA has its HQ in a specific city, that city can be assumed to be one the TA services.

  8. After the data has been readied, I count every unique county across the counties_served column and update the counties.geojson file with that information.
  9. Then I open a PR automatically with the TWO files I know should be touched in this workflow. See #513 for an example of this workflow succeeding

    Word of caution: do NOT blindly do a git add . (or its equivalent) in this PR. The google-github-actions/auth action will write your credentials to a file on the workflow's disk. If you accidentally add that, it'll cause you to expose the access key; which I did in #512 by accident.

What's left to do

What's next?

Not sure! /cc @thekaveman @machikoyasuda @angela-tran

netlify[bot] commented 7 months ago

Deploy Preview for cal-itp-mobility-marketplace ready!

Name Link
Latest commit 6c33ce11eefd1ab60f7176dde5a5a73a2257d01c
Latest deploy log https://app.netlify.com/sites/cal-itp-mobility-marketplace/deploys/65dec59ae38b7a000743f524
Deploy Preview https://deploy-preview-518--cal-itp-mobility-marketplace.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

mrose914 commented 7 months ago

@allejo thank you for this! One thing I notice is that the column names aren't showing up in the table:

Screenshot 2024-01-25 at 3 59 54 PM

Beyond that, I'd like to review with @thekaveman and determine what tweaks can be made within the existing design to support the new backend. I will also write copy for the blurb on the bottom of the page about how the data is produced/shown.

allejo commented 7 months ago

Since my last update to this PR, I've changed the layers.json for the state map in ddf54d4 (#518).

image

A few notes on my changes here. It appears the 5 represents now many different steps or shades there will be between the starting and ending colors. I changed the 0 from its original value of 1; I'm not sure what this does but I know it allows for better shading between colors.

"fill-color": [
  "interpolate",
  ["linear"],
  ["get", "num_providers"],
  0,
  "hsl(208, 36%, 78%)",
  5,
  "hsl(260, 53%, 36%)"
]

I also updated the column names so they appear on the Transit Agency table now.


The data dictionary is working, but it looks ugly. Not sure if you want to address that in this PR or in a future PR with a new design.

image

mrose914 commented 7 months ago

@allejo thank you! This looks better. Let me take a pass at cleaning up the glossary and the Source/About Data sections today. Once we adjust the content/copy, we should be good to push this out. I don't see a need for any design tweaks for this version.

One follow up question, what mechanisms/controls do we have in place in the event the connection between the source data and the website goes down or is otherwise compromised? Want to make sure we have a way to detect issues with the view since we are now using it on the live site. And apologies if you already answered this - just need a little refresher :)

allejo commented 7 months ago

One follow up question, what mechanisms/controls do we have in place in the event the connection between the source data and the website goes down or is otherwise compromised? Want to make sure we have a way to detect issues with the view since we are now using it on the live site. And apologies if you already answered this - just need a little refresher :)

No apologies necessary! This is a good question that I haven't answered yet. This PR introduces a GitHub Action that runs on a schedule, every Sunday at midnight UTC (which translates to Saturdays 4pm). Should this process fail, we can trigger a Slack notification or something, depending on what is currently set up. @thekaveman would have a better idea on how we can set up notifications

mrose914 commented 7 months ago

Updates:

For intro blurb: From buses, shuttles, and on-demand paratransit to commuter and intercity rail, passengers want transportation that’s easy to find and easy to pay for.

To better understand the size and scale of opportunities for integration in transportation, the Cal-ITP team compiled a list of California’s transit providers.

Explore the data below, which links to each operator’s website and offers just a taste of our full dataset—a CSV download that includes funding information, fleet size, GTFS information, and more.

Download the data package The data package includes the full dataset, data dictionary, and glossary as CSV files.

For Data Table

For About the Data

This dataset includes [a regularly updated view of public......@evansiroky I'd love your input here. I'm looking for a blurb that describes, in laymans terms, that we maintain a particular definition of agencies that can be seen here. Something that communicates that our goal is to move to common way to view/describe "services" or "agencies" in the state to facilitate ease of analysis and other projects..]. Each provider entry includes its headquarter city, counties where active, the associated National Transit Database (NTD) ID (if available), the website URL, and key attributes from the NTD. It is currently limited to providers who operate their own services.

This list is a living document that will be updated regularly with new service provider information. If you have comments, additions, or corrections to this dataset, we’re working on ways for you to share that feedback. Please check back soon.

For Data Dictionary

For Source

This view is pulling from Cal-ITP's data warehouse. We aim to maintain data on all providers within the state in order too... [@evansiroky could use some help here on how we describe the warehouse in laymans terms.]

evansiroky commented 7 months ago

I'm a little concerned that this uses a scheduled GitHub action instead of it being a new DAG task within Airflow. Is there a reason we aren't using Airflow in this case? Update: I can now see how this is a relatively contained workflow that is very specific to this website.

evansiroky commented 7 months ago

I'm concerned about some transit agencies not appearing in the preview site at https://deploy-preview-518--cal-itp-mobility-marketplace.netlify.app/ For example: Burbank Bus, City of Fresno, SFMTA.

mrose914 commented 7 months ago

@evansiroky @allejo let me know if you'd like to find time to discuss the query - I thought we tied all our numbers out to match the query to our spreadsheet some months back.

allejo commented 6 months ago

@mrose914 I've made the content updates you requested! and I changed how the Data Dictionary section displays, it's no longer a table @evansiroky great catch! Turns out the way I was fetching the providers were limited to just 100 entries. I've updated it so that it returns everything and now Fresno is listed in there


Technical change, I've modified the CSV to JSON plugin so that it reads the CSV files into a site.data.csv variable. I am now generating the data dictionary statically at build time, which gives me better control of how to render it.

evansiroky commented 6 months ago

Should I be seeing all of this data on the preview website? I'm only seeing one entry for Fresno County when there should be more.

image

allejo commented 6 months ago

Oh man, sorry for all the back and forth. Okay, I think this are displaying as they should be 🤞

image

evansiroky commented 6 months ago

I think this looks good enough now. It's a little weird that City of Camarillo is showing up in Santa Clara County, but I think it's good enough for now.

mrose914 commented 6 months ago

great! @evansiroky two questions:

evansiroky commented 6 months ago

@mrose914 , I'm kind of just eyeballing and clicking around to check things, I'm not doing any kind of data comparisons.

What do you mean by copy? The changes I see here seem to be related to the list of providers that show up on the map.

mrose914 commented 6 months ago

@evansiroky got it, thanks.

Re: copy, see the headings in this comment: https://github.com/cal-itp/mobility-marketplace/pull/518#issuecomment-1922482207

There is some content describing the data on the page that needs to be updated to describe the latest data source(s). I could use some help in how we want to describe our data warehouse and this view to those who visit the page.

evansiroky commented 6 months ago

@evansiroky got it, thanks.

Re: copy, see the headings in this comment: #518 (comment)

There is some content describing the data on the page that needs to be updated to describe the latest data source(s). I could use some help in how we want to describe our data warehouse and this view to those who visit the page.

@mrose914 I see it now. Here is some proposed wording for this list of transit agencies:

This dataset includes all publicly-funded transit agencies in California that provide fixed-route service that is available for the general public to ride without advance reservations.

The rest of the copy looks great!

allejo commented 6 months ago

Alright @mrose914, that should do it for the last copy edit! 🤞

mrose914 commented 6 months ago

Thanks @allejo - copy is looking good. I'm inclined to remove the "Source" section at the bottom of the page given the content above and CTA to contact us for more info. @evansiroky Are you good with that or would you like to add something about how the list was developed?

evansiroky commented 6 months ago

Thanks @allejo - copy is looking good. I'm inclined to remove the "Source" section at the bottom of the page given the content above and CTA to contact us for more info. @evansiroky Are you good with that or would you like to add something about how the list was developed?

Yep, fine with that. We can save the detailed dataset documentation for stuff we put on the CA Open Data Portal.

allejo commented 6 months ago

@mrose914 I've updated the page to remove the source section ✅