cal-itp / data-infra

Cal-ITP data infrastructure
https://docs.calitp.org/data-infra
GNU Affero General Public License v3.0
47 stars 12 forks source link

Bug: Some Invalid Shapes Data in Warehouse for LBT (170), too few points #1266

Closed edasmalchi closed 2 years ago

edasmalchi commented 2 years ago

Describe the bug Certain shapes in cal-itp-data-infra.views.gtfs_schedule_dim_shapes_geo for Long Beach Transit (170) seem to have some incorrect records (too few points).

To Reproduce Query I use to get shapes for a single feed x analysis date, filtered to two shapes showing issue:

SELECT *,
FROM `cal-itp-data-infra.views.gtfs_schedule_dim_shapes_geo`
WHERE calitp_itp_id = 170 AND calitp_extracted_at <= '2022-03-23' AND calitp_deleted_at > '2022-03-23'
AND shape_id IN ('1720090', '1030063') # two shapes referenced in notebook
LIMIT 1000
Screen Shot 2022-03-28 at 12 36 00 PM

This query returns three records per shape_id, two seem correct (many points in pt_array), but one each only has 4 points in pt_array and do not correctly describe the shape. The other two look OK based on some simple mapping and comparison to the transit agency's maps.

Additionally, I'm unsure why filtering on calitp_extracted_at and calitp_deleted_at here returns multiple records per shape_id?

Expected behavior All records describing valid shapes (generally, with many points)

Additional context Notebook with additional examples: https://github.com/cal-itp/data-analyses/blob/rt-selfserve/rt_delay/16_test_new_shapes.ipynb

lauriemerrell commented 2 years ago

Ahhhhhh I bet I know what's going on here, this is an example of one of the things I've been talking about where our type 2 logic is a problem. Sorry I didn't anticipate this.

I am 90% sure this is because the type 2 (slowly changing dimension) logic is all applied at the row level, so this is versioning at the shape point level rather than the shape ID level.

I will need to think about this a little bit..... I think I'll basically need to implement shape ID level versioning separately within this table.

I'll get to this ASAP. In the meantime heads up that GTFS Schedule Feeds Latest should not be subject to this problem, because it uses different date logic.

lauriemerrell commented 2 years ago

Maybe that wasn't clear -- basically, shape_id is not actually really being used here at all for the versioning under the hood, that's why you're seeing multiples.

And this is making me wonder if it's possible that we have any other versions of this problem for example in anything derived from stop_times where you have a similar process of constructing components into a whole. I will need to look into that.

edasmalchi commented 2 years ago

Totally clear, I think we talked about it a bit last week but yeah I didn't make the connection that it could cause issues here either. Thanks Laurie!