CityofToronto / bdit_flashcrow

Working repository for MOVE, a project to modernize transportation data systems at the City of Toronto.
MIT License
9 stars 2 forks source link

Study inconsistencies between Flow and MOVE #529

Closed mkewins closed 3 years ago

mkewins commented 4 years ago

What Happened A couple folks at Traffic Signal Ops reported that certain traffic count reports are available in MOVE, but not Flow, and vice versa.

What Should Happen Reports should be consistently available between MOVE and Flow.

To Reproduce Steps to reproduce the bug:

  1. Go to Flow
  2. Search for TCS 1346
  3. Click on the intersection
  4. See that no studies are available ("No data for the selected node(s)")
  5. Go to MOVE
  6. Look for the same location (Yorkdale Rd / Allen N Yorkdale Rd Ramp / Yorkdale 401 Allen N Ramp)
  7. See that there are 7 TMCs available

Screenshots TCS 1346

Flow vs  MOVE TCS 1346

TCS 1476

Flow vs  MOVE TCS 1476

Additional Notes The TSO folks found two cases:

Unsure if these two cases are caused by the same underlying problem or something different.

candu commented 4 years ago

Haven't looked into it yet, but some quick thoughts.

One very likely culprit is our conflate_artery_centreline process in Airflow. Each FLOW study is attached to an "arterycode", which describes the location / direction of the study. To power the centreline-focused interactions in MOVE, we attempt to tie each such arterycode to an intersection or midblock in the centreline. This process isn't 100% accurate, however; an error there would mean that the arterycode is tied to the wrong centreline location, which would result in the behaviour described here.

We do have a set of manual overrides borrowed from previous work by @aharpalaniTO and others, but that's not likely to be exhaustive either.

Other than that, I'm not sure what else would cause this; when I get back to this (probably after more of the multi-location stuff), I'll check to see if it's a conflate_artery_centreline issue first. If it is, I'll probably add it to our manual overrides set for now. If not, I'll have to dig deeper.

I also don't know how we would assess consistency across FLOW and MOVE more systemically: other than one-off bug tickets, I don't have a good way right now to determine the whole set of reports available in FLOW, let alone compare that (or port it over) to MOVE. Definitely welcome any thoughts from @aharpalaniTO @radumas @jwrcoleman and others who have worked with these datasets / applications longer than I have!

mkewins commented 4 years ago

Adrian Hetherington reported another intersection where the data is in Flow, but is missing from MOVE. Intersection is Wilmington Ave and Finch Ave West.

I can't reproduce it though -- I'm seeing studies from the same date in both Flow and MOVE. What I am seeing is different count totals.

candu commented 3 years ago

@mkewins - the Wilmington and Finch count total mismatch might have been solved by #666 , which fixed a general bug with TMC reports at T-intersections.

The other parts of this look like a potential conflation issue with intersections. There's quite a large number of "psuedo-intersections" around Yorkdale Rd, mostly where the road meets various expressway ramps:

select count(*), elevatio9, elevatio10 from gis.centreline_intersection where intersec5 like '%Yorkdale Rd%' group by elevatio9, elevatio10;
 count | elevatio9 | elevatio10 
-------+-----------+------------
     9 |    501100 | Expressway
     7 |    501300 | Minor
     1 |    504000 | Pedatraian
     1 |    509100 | Cul de sac
    36 |    509200 | Pseudo

We exclude all intersections where elevatio9 > 501700 from both centreline vector tiles and location search, mostly to limit visual / search clutter from trailheads and railway crossings and such.

However, we do include all non-trail intersections when conflating counts against the centreline. Even with that, though, we still fail to conflate ~300 intersections:

select count(*) from "TRAFFIC"."ARTERYDATA" ad join counts.arteries_intersection ai on ad."ARTERYCODE" = ai.arterycode wh
ere ai.int_id is null;
 count
-------
   287
(1 row)

This unfortunately includes PX 1476:

select * from "TRAFFIC"."ARTERYDATA" ad join counts.arteries_intersection ai on ad."ARTERYCODE" = ai.arterycode where ai.
int_id is null and ad."LOCATION" like '%YORKDALE RD%';
 ARTERYCODE | GEOMCODE | STREET1  | STREET1TYPE | STREET1DIR |        STREET2         | STREET2TYPE | STREET2DIR |            STREET3             | STREET3TYPE | STREET3DIR | STAT_CODE  | COUNT_TYPE |                             LOCATION                              | APPRDIR | SIDEOFINT |  LINKID  | SEQ_ORDER | GEO_ID | arterycode | match_on_case | int_id | geom
------------+----------+----------+-------------+------------+------------------------+-------------+------------+--------------------------------+-------------+------------+------------+------------+-------------------------------------------------------------------+---------+-----------+----------+-----------+--------+------------+---------------+--------+------
       4960 |        0 | YORKDALE | RD          |            | YORKDALE N PARKING LOT |             |            | PX 1476  335M E OF DUFFERIN ST |             |            | 0015000025 | R          | YORKDALE RD AT YORKDALE N PRKG LOT & 335M E OF DUFFERIN (PX 1476) |         |           | 15000025 |           |        |       4960 |             3 |        |
(1 row)

This is a complicated one: from the point of view of the centreline, it's on a midblock, because there's no City road to intersect with. From the counts / studies point of view, it's at an intersection where one of the roads happens to be a private parking lot entrance.

Part of the complexity here lies in how location is represented in FLOW data:

This isn't great, and we should very much try to move away from it when we tackle data pipelines. (For instance: if these counts were tagged directly with centreline IDs plus lat-long location, that would be great! Even reliable lat-long location alone would help a great deal.)

candu commented 3 years ago

Going to suggest the following course of action:

Not sure what the priority on any of this should be, though - might make more sense to dive in during data pipeline phases of the project?

jwrcoleman commented 3 years ago

One thought would just be to make sure that all signalized intersections are included in the conflation, otherwise defer to data pipeline phase (with suitable warnings applied etc.). I think you've hit on what will be a common problem, where not all traffic signals are at centreline_intersection locations, specifically signals at private driveways or parking lots. See below (white circles are intersections, red circles are traffic signals): image

jwrcoleman commented 3 years ago

Looking at the traffic_signal layer, there are 206 records where midblock_route is not Null (our traffic_signal layer has not been updated since October 2019 so this might miss a couple new ones) image

candu commented 3 years ago

Another case reported from longitudinal study feedback:

Map image showing conflation mistake at Sumach and Shuter

From the bug report and a bit of digging, these studies should have centreline_id = 30026352, but are instead conflated to centreline_id = 13465353.

Looking deeper, this is the result of conflation for arterycode 11309:

flashcrow=> select * from counts.arteries_intersection where arterycode=11309;
 arterycode | match_on_case |  int_id  |                        geom
------------+---------------+----------+----------------------------------------------------
      11309 |             2 | 13465353 | 0101000020E610000057E12CAB04D753C0E8D43A1640D44540

However, further exploration reveals a conflicting picture: centreline_id = 13465353 is a better spatial match for this arterycode, but centreline_id = 30026352 matches up better with the FILE_NAME that describes this as being at Sumach and Shuter! (Welcome to the fun, fun world of legacy data.)

Because of this, it's unclear how to proceed. Do we do nothing? Do we enter a manual override, which also risks mislabelling this arterycode? How do we make decisions like this in general, and how do we explain those decisions to our users?

Further Exploration

We first query to get at the arterycode:

flashcrow=> select * from "TRAFFIC"."COUNTINFOMICS" where "COUNT_INFO_ID" in (select count_group_id from counts.studies where centreline_id = 13465353);

 COUNT_INFO_ID | ARTERYCODE | COUNT_TYPE |     COUNT_DATE      | DAY_NO | COMMENT_ |               FILE_NAME                | SOURCE1 | SOURCE2 |      LOAD_DATE      | TRANSFER_REC | CATEGORY_ID                                                                          
---------------+------------+------------+---------------------+--------+----------+----------------------------------------+---------+---------+---------------------+--------------+-------------                                                                                  
         32659 |      11309 | R          | 2015-06-22 00:00:00 |      2 |          | Crunch..EF06.23 & EF06.23 & EF07.07.15 |         |         |                     |              |           5                                                                                   
         36777 |      11309 | R          | 2018-02-27 00:00:00 |      3 |          | 88186_SUMACH_ST_and_SHUTER_ST_BIKE.hex |         |         |                     |              |           5                                                                                    
          8463 |      11309 | R          | 2000-05-30 00:00:00 |      2 |          | YY05.06                                |         |         | 2000-07-12 08:40:53 |           -1 |           5                                                                                    
          6496 |      11309 | R          | 1997-12-01 07:45:00 |      1 |          |                                        |         |         |                     |            0 |           5

match_on_case = 2 here means this is matching using counts.arteries_intersection_int_id, which in turn uses "TRAFFIC"."NODE":

flashcrow=> select * from counts.nodes_intersection where link_id = (select "LINKID"::bigint from "TRAFFIC"."ARTERYDATA" where "ARTER
YCODE" = 11309);
 link_id  | match_on_case |  int_id  |                        geom
----------+---------------+----------+----------------------------------------------------
 13465353 |             1 | 13465353 | 0101000020E610000057E12CAB04D753C0E8D43A1640D44540

match_on_case = 1 here means it's matched the link_id directly to an intersection ID (which makes sense, they're the same value.)

However, even if we did perform a spatial match instead of matching link_id against the centreline, results here suggest that int_id = 13465353 is in fact a much better match:

SELECT nc.link_id, 2 AS match_on_case, ci.int_id, ci.geom, ci.geom_dist
  FROM counts.nodes_corrected nc
  JOIN LATERAL (
    SELECT
      int_id,
      geom,
      ST_Distance(ST_Transform(nc.geom, 2952), ST_Transform(geom, 2952)) AS geom_dist
    FROM counts.centreline_intersection
    WHERE ST_DWithin(ST_Transform(nc.geom, 2952), ST_Transform(geom, 2952), 30)
    ORDER BY geom_dist ASC
  ) ci ON true
  WHERE nc.link_id = 13465353;

 link_id  | match_on_case |  int_id  |                        geom                        |     geom_dist
----------+---------------+----------+----------------------------------------------------+-------------------
 13465353 |             2 | 13465353 | 0101000020E610000057E12CAB04D753C0E8D43A1640D44540 | 0.120069462801905
 13465353 |             2 | 30026352 | 0101000020E61000006F07342202D753C0B1AD253843D44540 |  16.5067114852413
candu commented 3 years ago

Another "we need midblock TMCs" report, near Morningside and Beath:

Could not find TMC at Morningside and West Hill Collegiate Access, but available on Flow

Centreline-wise this is a midblock, but in reality there's an intersection there with a private driveway to West Hill Collegiate Institute.

candu commented 3 years ago

Another "we need midblock TMCs" report, near Bayview and Blythwood:

I just noticed that the attached tmc at the driveway to 1929 Bayview is in Flow but not in Move.

candu commented 3 years ago

A report of a TMC that was conflated to the wrong intersection:

A turning movement count (TMC) for ""VAUGHAN RD AT MOIR AVE AND WINONA DR (PX 802)"" shows up in FLOW, the date is 2018-02-15.

The TMC report doesn't show up for that location in MOVE. In fact, by looking at the studies for the surrounding midblocks / intersections, it appears that the TMC has been assigned to the intersection directly to the Northwest: https://move.intra.qa-toronto.ca/view/location/s1:AMQsmB/POINTS/reports/TMC

Searching for px:802 jumps to the expected location, where (as mentioned) there are no TMCs. This looks like a mis-conflation of the underlying arterycode:

flashcrow=> select * from "TRAFFIC"."COUNTINFOMICS" where "COUNT_INFO_ID" = 36723;
 COUNT_INFO_ID | ARTERYCODE | COUNT_TYPE |     COUNT_DATE      | DAY_NO |                                                                                            COMMENT_                                                                                             |                FILE_NAME                | SOURCE1 | SOURCE2 | LOAD_DATE | TRANSFER_REC | CATEGORY_ID
---------------+------------+------------+---------------------+--------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------+---------+---------+-----------+--------------+-------------
         36723 |       5865 | R          | 2018-02-15 00:00:00 |      5 | NBL: Moir Ave to WB Vaughan Rd, NBT: Moir Ave to EB Vaughan Rd, NBR: Moir Ave to SB Winina Dr, SOTH: Moir Ave to NB Winona Dr, EOTH: WB Vaughan Rd to Moir Ave, WOTH: EB Vaughan Rd to Moir Ave | 88205_VAUGHAN_RD_and_WINONA_DR_BIKE.hex |         |         |           |              |           5
(1 row)

flashcrow=> select * from "TRAFFIC"."ARTERYDATA" where "ARTERYCODE" = 5865;
 ARTERYCODE | GEOMCODE | STREET1 | STREET1TYPE | STREET1DIR | STREET2 | STREET2TYPE | STREET2DIR | STREET3 | STREET3TYPE | STREET3DIR | STAT_CODE  | COUNT_TYPE |                   LOCATION                    | APPRDIR | SIDEOFINT |  LINKID  | SEQ_ORDER | GEO_ID
------------+----------+---------+-------------+------------+---------+-------------+------------+---------+-------------+------------+------------+------------+-----------------------------------------------+---------+-----------+----------+-----------+--------
       5865 |        0 | VAUGHAN | RD          |            | WINONA  | DR          |            | PX 802  |             |            | 0013460056 | R          | VAUGHAN RD AT MOIR AVE AND WINONA DR (PX 802) |         |           | 13460056 |           |
(1 row)

flashcrow=> select * from counts.arteries_centreline where arterycode = 5865;
 arterycode | centreline_type | centreline_id | direction |                        geom
------------+-----------------+---------------+-----------+----------------------------------------------------
       5865 |               2 |      13459974 |           | 0101000020E6100000B34523B6F3DB53C04C9994B55FD84540
(1 row)

flashcrow=> select * from gis.centreline_intersection where int_id = 13459974;
   gid   |  int_id  | elev_id |        intersec5         | classifi6 |     classifi7      | num_elev | elevatio9 | elevatio10 | elev_level | elevation | elevatio13 | height_r14 | height_r15 |     x     |      y      |   longitude   |   latitude   | objectid |                        geom
---------+----------+---------+--------------------------+-----------+--------------------+----------+-----------+------------+------------+-----------+------------+------------+------------+-----------+-------------+---------------+--------------+----------+----------------------------------------------------
 2380937 | 13459974 |    2400 | Vaughan Rd / Jesmond Ave | MNRSL     | Minor-Single Level |        1 |    501300 | Minor      |          0 |         0 |            |          0 |            | 309883.05 | 4838721.236 | -79.436749968 | 43.690420816 |    17452 | 0101000020E6100000B34523B6F3DB53C04C9994B55FD84540
(1 row)

This conflation has match_on_case = 1, corresponding to the counts.arteries_intersection_manual_corr view. Looking deeper into the view query results, the correct conflation is in fact a possible value:

 arterycode |  int_id  | ranking
------------+----------+----------
       5865 | 13459974 | 13459974
       5865 | 13460056 | 13460056

However, ranking is just using int_id, which results in a preference for the incorrect conflation here. We could use the LOCATION field of "TRAFFIC"."ARTERYDATA", as this contains PX 802 (and could be used to match).

candu commented 3 years ago

The above push seems to address the cases mentioned here - closing for now, and we can re-open one-off tasks as needed.