GTFS Schedule table with historical shapes data

edasmalchi commented 2 years ago

@tiffanychu90 and I have been doing a lot of spatial analysis using GTFS data. To do this effectively we need shapes, which currently only exists in GTFS Schedule Latest. This means we can only currently do spatial analyses on the most recent feeds.

This means we can't do certain kinds of analyses (for example, an analysis of how High Quality Transit Areas have changed in recent years), and also makes it challenging/impossible to validate or reproduce analyses we've done as recently as a few weeks ago.

Describe the solution you'd like Table(s) with GTFS shapes data alongside existing dim/fact tables with historical GTFS schedule data currently in Warehouse Views

Describe alternatives you've considered Continuing to have only the latest shapes data and being unable to do certain spatial analyses.

Additional context Tangentially related to https://github.com/cal-itp/data-infra/issues/1139?

lauriemerrell commented 2 years ago

More than tangentially related to #1139! This can be addressed pretty easily as part of #1137. Hope to land it maybe next week! 🤞

lauriemerrell commented 2 years ago

(Just to confirm as context -- this is data that we already fully have in gtfs_schedule_type2.shapes, we just weren't producing cleaned versions of that or exposing them as views.)

edasmalchi commented 2 years ago

Oh awesome, I'm not so familiar with the type2 tables so wasn't sure how much effort this would be

tiffanychu90 commented 2 years ago

Oooh, also wasn't aware of these tables. I think we can spin up some notebooks to see if it's good as-is, or some work is needed to store a more processed version of this based on this shared utils function.

@atvaccaro mentioned that possibly BigQuery can't support something complex like a LineString or MultiLineString in a way that we have with geopandas. It's more the fact that assembling those points into lines for all agencies takes ~40 min, and it'd be nice to be able to query a given date and grab the associated calitp_itp_id, calitp_url_number, shape_id, associated line geometry.

lauriemerrell commented 2 years ago

I have written this query and I think it does what we want; I need to double check and actually test it but it at least superficially works and creates things of the desired data types AFAICT. I'll be working on this ~next week to actually incorporate this into like a dim_shapes type thing.

@tiffanychu90 or @edasmalchi if you want to experiment with this in BigQuery in the meantime, I think that's fine too -- note I have the LIMIT 100 in the top query to keep the size small, but if you wanted to filter to a specific itp_id + url_number + calitp_extracted_at that would work to get a unique agency (and add shape_id for specific shape). Note that you shouldn't just filter on shape_id without calitp_extracted_at because this table has all versions of each shape_id over time so that could get funky.

with lat_long as (
    SELECT 
        calitp_itp_id,
        calitp_url_number,
        calitp_extracted_at,
        shape_id,
        shape_pt_sequence,
    ST_GEOGPOINT(CAST(shape_pt_lon AS FLOAT64), CAST(shape_pt_lat AS FLOAT64)) as pt_geom
    FROM `cal-itp-data-infra.gtfs_schedule_type2.shapes`
    LIMIT 100
)

SELECT 
    calitp_itp_id,
    calitp_url_number,
    shape_id,
    ST_MAKELINE(ARRAY_AGG(pt_geom ORDER BY shape_pt_sequence)) as line_geom
FROM lat_long
GROUP BY 
    calitp_itp_id,
    calitp_url_number,
    calitp_extracted_at,
    shape_id

tiffanychu90 commented 2 years ago

@lauriemerrell Would this query be the right way to make sure that the date we want grabs the associated shape_id/line geometry that day? In views.gtfs_schedule_fact_daily_trips, we usually select service_date. Just want to double check that using >=calitp_extracted_at and <=calitp_deleted_at gets at the same thing.

from calitp.tables import tbl
from calitp import query_sql
from siuba import *

import datetime as dt
SELECTED_DATE = dt.date(2021, 10, 22)

table = query_sql(
    '''
    with lat_long as (
        SELECT 
            calitp_itp_id,
            calitp_url_number,
            calitp_extracted_at,
            shape_id,
            shape_pt_sequence,
            calitp_deleted_at,

        ST_GEOGPOINT(CAST(shape_pt_lon AS FLOAT64), CAST(shape_pt_lat AS FLOAT64)) as pt_geom
        FROM `cal-itp-data-infra.gtfs_schedule_type2.shapes`
        WHERE
            calitp_itp_id = 182 AND calitp_extracted_at <= '2021-10-22'
            AND calitp_deleted_at >= '2021-10-22'
        LIMIT 100
    )

    SELECT 
        calitp_itp_id,
        calitp_url_number,
        shape_id,
        calitp_extracted_at,
        ST_MAKELINE(ARRAY_AGG(pt_geom ORDER BY shape_pt_sequence)) as line_geom
    FROM lat_long
    GROUP BY 
        calitp_itp_id,
        calitp_url_number,
        calitp_extracted_at,
        shape_id
    '''
)

lauriemerrell commented 2 years ago

@tiffanychu90 -- Yes that looks right except I think you want a strict inequality for calitp_deleted_at > '2021-10-22' (there can be two copies of a shape that meet the criteria if both inequalities are inclusive.)

Please let me know if the LINESTRING geometry that this makes is actually useful or if we need to investigate further transformations in BQ to get output that's useful for the other GIS packages you all have been using on the notebook side (geopandas?). And I'll be curious how this performs in notebooks... I definitely think we'd move it into BQ shortly but I am wondering if this already gives a performance boost?

tiffanychu90 commented 2 years ago

@lauriemerrell The linestring is what is preferred, over points. I'll do some checking to see if this can match up with a previous date we'd used with shapes already constructed.

Do you think that it's better for us to query it this way for all the operators, whenever we need shapes for all the operators? It's been more common that we want all the operators, rather than just a single operator. The shared_utils version does allow for a list of operators. When it's run through all the operators, that's when it gets time-consuming.

If we were to do it in a notebook, I still think we would query this for a past date, then stash it as a parquet in GCS. Would it be better to have these queries done and stashed somewhere already as linestrings (BQ or GCS) with DAG or this one-offs for given dates and stashed in GCS related to a project?

lauriemerrell commented 2 years ago

I guess to clarify, I am envisioning that very soon we will have a table called like gtfs_schedule_dim_shapes that has this table with the linestring geometry already created. So, in that case I think the query would become:

SELECT * 
FROM gtfs_schedule_dim_shapes
WHERE calitp_extracted_at <= [date of interest]
            AND calitp_deleted_at > [date of interest]

^ and that would grab for all operators etc.

I am not sure about the stashing in parquet.... Do you have one parquet per feed? Or one parquet per shape? Or one parquet per analysis date, with all shapes + feeds combined?

I think it would probably make sense for you all to run the above query (once the new table exists) and then stash those results at whatever level in an analysis-specific folder, assuming the goal is just replicability. If the goal is execution time, hopefully with the new table, there will be essentially zero processing time for you all, you can just grab the pre-computed linestrings with no extra work.

Longer term, I think that there is a thought that that stashing would happen within the warehouse by just writing the results to a new BQ table via dbt, but that is a ways off.

cc @atvaccaro in case he wants to disagree with me

tiffanychu90 commented 2 years ago

@lauriemerrell Ahh, if there is the gtfs_schedule_dim_shapes, then that's ideal. That would make it reproducible and replicable.

This is no longer relevant: I guess I didn't realize there would be dim_shapes, and was thinking a DAG would be called to write a new analysis date + shapes as parquets to be stored in GCS, as a holdover, until a BQ table was ready.

The parquets were per analysis date + all the shapes + feeds stored as a geoparquet. We might still sometimes store it, just to avoid hitting the warehouse so much while we're exploring in a notebook.

lauriemerrell commented 2 years ago

Yeah, sorry for not clarifying -- I make a shapes_clean in this PR: https://github.com/cal-itp/data-infra/pull/1162, which might not merge until later this week or early next because it's part of a bunch of larger things, but as soon as that's in creating dim_shapes is essentially trivial.

So... Coming soon! I just wanted to share the above queries in case they're helpful as a holdover until that table is available.

lauriemerrell commented 2 years ago

Notes from @edasmalchi and @mjumbewu :

In the queries above, we need to CAST shape_pt_sequence as an integer; at this point it's a string and that causes those queries to sort alphabetically rather than numerically.
It still seems(?) like BQ's implementation makes self-intersecting shapes into MULTILINESTRINGs; @edasmalchi to report back on whether that can be dealt with easily on the shapely side or whether that's a hard blocker.

mjumbewu commented 2 years ago

Consider the following LineString that proceeds from the southwest to the northeast of the following image, and then loops back on itself before continuing further north (the red arrows were added by me manually):

When we attempt to represent this route in BigQuery using a few different methods, the result is always the same:

when constructing the points individually:

select st_makeline([
    st_geogpoint(0, 0),
    st_geogpoint(0, 1),
    st_geogpoint(1, 1),
    st_geogpoint(2, 2),
    st_geogpoint(1, 1),
    st_geogpoint(0, 1),
    st_geogpoint(0, 2)
])

when constructing the segments individually:

select st_makeline([
    st_makeline(st_geogpoint(0, 0), st_geogpoint(0, 1)),
    st_makeline(st_geogpoint(0, 1), st_geogpoint(1, 1)),
    st_makeline(st_geogpoint(1, 1), st_geogpoint(2, 2)),
    st_makeline(st_geogpoint(2, 2), st_geogpoint(1, 1)),
    st_makeline(st_geogpoint(1, 1), st_geogpoint(0, 1)),
    st_makeline(st_geogpoint(0, 1), st_geogpoint(0, 2))
])

_even when decoding a WKT string :openmouth::

select st_geogfromtext('linestring(0 0,0 1,1 1,2 2,1 1,0 1,0 2)')

Each time, the result is the same MultLineString:

MULTILINESTRING((0 0, 0 1, 1 1, 2 2), (0 1, 0 2))

Which is the following route:

So BigQuery is not inserting the doubled-over portion into separate line strings; it's just entirely skipping segments that it feels it has already covered. Fine for representing a shape, but not a path.

Posted a Stack Exchange question, in case anyone has a solution.

edasmalchi commented 2 years ago

Thanks @mjumbewu for the concise example! I'm observing the same behavior with that SamTrans shape I was looking at earlier, but it's not as easy to document/reproduce here.

@lauriemerrell this does seem like a hard blocker for my rt analysis, which depends on using a LineString as a path, though not necessarily for all kinds of analysis (some analyses only need the shape of the trip).

Not sure what the best path forward is, but at least for now it seems like I could just use gtfs_schedule_type2.shapes and continue to construct the LineStrings using Shapely when doing historical RT analyses?

lauriemerrell commented 2 years ago

@edasmalchi yeah, that's what I figured (Mjumbe and I walked through some of this live yesterday... The BQ implementation blows my mind in a bad way, especially the "when constructing the segments individually" one.)

A few thoughts:

We can at least turn the lat/long pairs into GEOGRAPHY types to make points, and then you all could concatenate them on the shapely side.
We could also make the pairwise LINESTRINGs, which could be concatenated on the shapely side.
Alternatively, we can try to construct a string that can be exported that has all the points concatenated that you all could read as a linestring on the shapely side. Like, we can not ask BQ to try to treat them as geos and instead try to just construct a JSON-type thing that you can import as a geo outside but not tell BQ that it's a geo.
- This is obviously not actually good practice. But doesn't seem like we totally have a choice :P
- Also this is kind of implied by how the data is presented in GTFS so maybe it's not THAT bad an abuse of the data types etc.....?
- I just tried to play with this a bit and I think it would be a bit tricky, because BigQuery doesn't let you construct an ARRAY of ARRAYs, which is essentially what a LINESTRING is. But it's possible we could do something.

So once again I think part of the question is what is needed to save time for you all and what is easiest. We can maybe discuss in more detail once the existing work merges and we can experiment more directly in prod.

lauriemerrell commented 2 years ago

Some thoughts via @atvaccaro:

@mjumbewu -- looks like someone did ask about this on SO a few years ago and got confirmation of the behavior then 😕 https://stackoverflow.com/questions/58234223/st-makeline-discarding-duplicate-points-even-if-not-consecutive
Andrew also found: https://cloud.google.com/bigquery/docs/reference/standard-sql/geography_functions#st_geogfromgeojson, which I can experiment with too... This one at least gives options for whether we're trying to "make valid" and it also has an "oriented" option. So maybe worth playing with it to see if there's a way we can make the inputs into a GeoJSON which this can then parse (but that is fully within the zone of my third option above where we're going to have to do some iffy string stuff IMO.)

lauriemerrell commented 2 years ago

@tiffanychu90 and @edasmalchi -- I am marking this as fixed by #1242 because that at least creates a dim_shapes table. I don't want to foreclose future iteration on making the geographic aspect better, but I think it would be fair to create a subsequent ticket as needed after we meet tomorrow.

cal-itp / data-infra

GTFS Schedule table with historical shapes data #1210