CityofToronto / bdit_data-sources

Data sources used by the Big Data Innovation Team
https://github.com/orgs/CityofToronto/teams/bigdatainnovationteam
GNU General Public License v3.0
38 stars 8 forks source link

Debugging why CIS-TTC comparison results in ~2x processed data #134

Closed radumas closed 3 years ago

radumas commented 6 years ago

While working on #130, comparing the data from trip segments on the 514 & 504 between our CIS processing and the processing the TTC have done, we have run into the issue of having approximately 2x the number of trip-segments from our process CIS data vs. the TTC's.

To debug this we should first identify whether running our process on CIS for a week of data has produced the expected amount of output.

e.g.: image

radumas commented 6 years ago

Per above :point_up: issue is in this query, possibly because of using different number of shapes than in original query

chelsrosic commented 6 years ago

What I've analyzed At the bottom of the debugging_cis_ttc_comparison under the Results from first fix heading, I have analyzed the overall counts for the whole week between the segments provided from the TTC. I found that the cases are more similar to the TTC counts, however they are not exactly the same and I think the difference is still significant. For example, instead of having counts that are a few thousand off like they were before, the counts are about a few hundred off. Our counts are always lower than theirs now, before they were larger. Additionally, the segments with subways as one of the endpoints have especially large differences in counts (for example 586 versus 8).

What I tried/changed I knew that the subways had more than one stop_id associated with them but I did not take this into consideration when originally calculating counts in the analysis in the comparing_ttc_data notebook. I edited the section_runs table (the table with the TTC data) by adding an extra column so I could include the extra stop_id(s) for the segments that have an endpoint(s) at a subway station. The counts at the bottom of the debugging_cis_ttc_comparison under the Results from first fix heading take into account the the fact that subways have more than one stop_id. The edits that I made did not have an impact on the counts.

There was also a change that I made in the query from the match stop issue #88, I changed my query to match the one in the cis_processing.md file. I made a change in the right side of the inner join in the cis_gtfs with-clause. Before the query looked like:

WITH line_data AS(
SELECT geom AS line, direction_id FROM shapes_geom_20171119_20171125
    INNER JOIN trips_20171119_20171125 USING (shape_id)
    WHERE shape_id IN (695251, 695253)
    GROUP BY line, shape_id, direction_id
    ORDER BY shape_id
),

cis_gtfs AS(
SELECT date_time, id AS cis_id, stop_id, vehicle, a.direction_id,
ST_LineLocatePoint(line, position) AS cis_to_line,
ST_LineLocatePoint(line, geom) AS stop_to_line,
(CASE WHEN ST_LineLocatePoint(line, position) > ST_LineLocatePoint(line, geom)
      THEN 'after'
      WHEN ST_LineLocatePoint(line, position) < ST_LineLocatePoint(line, geom)
      THEN 'before'
      WHEN ST_LineLocatePoint(line, position) = ST_LineLocatePoint(line, geom)
      THEN 'same'
      END) AS line_position,
ST_Distance(position::geography, geom::geography) AS distance
FROM line_data a, cis_nov_514_angle b
INNER JOIN stops_20171119_20171125 USING (stop_id)
WHERE a.direction_id = b.direction_id
ORDER BY vehicle, a.direction_id, date_time
),

stop_orders AS (
SELECT *,
(CASE WHEN lag(stop_id, 1) OVER (PARTITION BY vehicle ORDER BY date_time) IS NULL
      THEN nextval('stops')
      WHEN stop_id <> lag(stop_id, 1) OVER (PARTITION BY vehicle ORDER BY date_time)
      THEN nextval('stops')
      WHEN stop_id = lag(stop_id, 1) OVER (PARTITION BY vehicle ORDER BY date_time)
      THEN currval('stops')
END) AS stop_order
FROM cis_gtfs
WHERE (line_position = 'before' AND distance <= 200) OR (line_position = 'after' AND distance <= 10) OR (line_position = 'same' AND distance <= 100)
ORDER BY vehicle, direction_id, date_time
)

SELECT MIN(date_time) AS arrival_time, MAX(date_time) AS departure_time, vehicle, stop_id, direction_id, array_agg(DISTINCT cis_id) AS cis_group
INTO match_stop_514_nov
FROM stop_orders
GROUP BY stop_order, vehicle, stop_id, direction_id; 

Now it looks like:

WITH line_data AS(
SELECT geom AS line, direction_id FROM shapes_geom_20171119_20171125
    INNER JOIN trips_20171119_20171125 USING (shape_id)
    WHERE shape_id IN (695251, 695253)
    GROUP BY line, shape_id, direction_id
    ORDER BY shape_id
),

cis_gtfs AS(
SELECT date_time, id AS cis_id, stop_id, vehicle, a.direction_id,
ST_LineLocatePoint(line, position) AS cis_to_line,
ST_LineLocatePoint(line, geom) AS stop_to_line,
(CASE WHEN ST_LineLocatePoint(line, position) > ST_LineLocatePoint(line, geom)
      THEN 'after'
      WHEN ST_LineLocatePoint(line, position) < ST_LineLocatePoint(line, geom)
      THEN 'before'
      WHEN ST_LineLocatePoint(line, position) = ST_LineLocatePoint(line, geom)
      THEN 'same'
      END) AS line_position,
ST_Distance(position::geography, geom::geography) AS distance
FROM line_data a, cis_nov_514_angle b
INNER JOIN (SELECT * FROM stops_20171119_20171125 WHERE stop_id IN (SELECT DISTINCT stop_id FROM crosic.nov_514_stop_pattern)) r
USING (stop_id)
WHERE a.direction_id = b.direction_id
ORDER BY vehicle, a.direction_id, date_time
),

stop_orders AS (
SELECT *,
(CASE WHEN lag(stop_id, 1) OVER (PARTITION BY vehicle ORDER BY date_time) IS NULL
      THEN nextval('stops')
      WHEN stop_id <> lag(stop_id, 1) OVER (PARTITION BY vehicle ORDER BY date_time)
      THEN nextval('stops')
      WHEN stop_id = lag(stop_id, 1) OVER (PARTITION BY vehicle ORDER BY date_time)
      THEN currval('stops')
END) AS stop_order
FROM cis_gtfs
WHERE (line_position = 'before' AND distance <= 200) OR (line_position = 'after' AND distance <= 10) OR (line_position = 'same' AND distance <= 100)
ORDER BY vehicle, direction_id, date_time
)

SELECT MIN(date_time) AS arrival_time, MAX(date_time) AS departure_time, vehicle, stop_id, direction_id, array_agg(DISTINCT cis_id) AS cis_group
INTO match_stop_514_nov
FROM stop_orders
GROUP BY stop_order, vehicle, stop_id, direction_id; 

This change in the processing did not impact any of the count numbers.