Closed edasmalchi closed 2 years ago
Ahhhhhh I bet I know what's going on here, this is an example of one of the things I've been talking about where our type 2 logic is a problem. Sorry I didn't anticipate this.
I am 90% sure this is because the type 2 (slowly changing dimension) logic is all applied at the row level, so this is versioning at the shape point level rather than the shape ID level.
I will need to think about this a little bit..... I think I'll basically need to implement shape ID level versioning separately within this table.
I'll get to this ASAP. In the meantime heads up that GTFS Schedule Feeds Latest should not be subject to this problem, because it uses different date logic.
Maybe that wasn't clear -- basically, shape_id
is not actually really being used here at all for the versioning under the hood, that's why you're seeing multiples.
And this is making me wonder if it's possible that we have any other versions of this problem for example in anything derived from stop_times
where you have a similar process of constructing components into a whole. I will need to look into that.
Totally clear, I think we talked about it a bit last week but yeah I didn't make the connection that it could cause issues here either. Thanks Laurie!
Describe the bug Certain shapes in
cal-itp-data-infra.views.gtfs_schedule_dim_shapes_geo
for Long Beach Transit (170) seem to have some incorrect records (too few points).To Reproduce Query I use to get shapes for a single feed x analysis date, filtered to two shapes showing issue:
This query returns three records per shape_id, two seem correct (many points in
pt_array
), but one each only has 4 points inpt_array
and do not correctly describe the shape. The other two look OK based on some simple mapping and comparison to the transit agency's maps.Additionally, I'm unsure why filtering on
calitp_extracted_at
andcalitp_deleted_at
here returns multiple records per shape_id?Expected behavior All records describing valid shapes (generally, with many points)
Additional context Notebook with additional examples: https://github.com/cal-itp/data-analyses/blob/rt-selfserve/rt_delay/16_test_new_shapes.ipynb