Closed PPaulsonOregonDOT closed 6 years ago
@srinivas13794 sounds like this was related to agencies with route miles contained outside of Oregon. Am I thinking of the correct issue here?
@ed-g There are two cases in which you can see a difference between route miles and the sum of urban,rural route miles:
@ed-g: Is it possible that there is a datum difference with the two datasets? I know the route shapes are going to come in as WGS 1984 (EPSG 4326). From what I can tell, the urban/rural boundary file provided by the census is provided in North American Datum 83 (EPSG 4269). If they aren't projected to the same projection (ideally Oregom Statewide Lamber (EPSG 2992) , it's possible that it could introduce some error in the intersection calculations. I'm not convinced that this is the issue, but we need to know for future reference.
@PPaulsonOregonDOT this is a little over my head with regard to GIS. We can research and find out.
We still need to at least get to the point where we are confident we understand the reason for the differences we are seeing. If practical we should also fix this.
Comments from Chris at Trillium:
Ah I see. That does make it more difficult to deduce. If we assume that Phillip is correct and route shapes is in one datum (WGS 1984 (EPSG 4326)) while the boundary file is in another (North American Datum 83 (EPSG 4269)) then there is a very high likelihood that this is the reason for the discrepancy in total miles. If you have two different datasets in two different projections there will always be some degree of error.
One way to tell for sure would be to have the same datasets projected in the same coordinate system and then run the miles calculation again. I imagine this would be pretty easy for them to test at ODOT.
This is a research task from my point of view. 4-16 hours guesstimate.
I was able to get access to the OSU version of the database, which I assume is the same, in terms of spatial boundaries, as the Trillium version that is currently "top spec". From looking at the May 2017 database, it looks like there's two sets of files for each relation that would be involved in the urban/rural route miles/service miles calculations. One of these tables has a coordinate reference system defined, one doesn't. Those that have CRS's defined appear all to be EPSG 4326, or WGS 1984, which is the native format for GTFS data. As a result, I think we can conclude that the projection is not the issue, or that if it is, the issue is in the pre-processing of the data.
One observation I had while investigating this issue was that the level of error varied depending on which agency you were looking at, which makes me think that the error might actually rest in the accuracy of the shapes being used, which would suggest that @srinivas13794 was correct in that the SQL command being used is the source of the error. Essentially, if you don't have a vertex for the route shape exactly where it overlaps with the urban/rural boundary, the segment that overlaps with that boundary is going to be left out, because PostGIS uses bounding boxes of the features to determine whether something intersects with the urban boundary or not. There may be a way to pre-process the shapes with the urban boundaries to add vertexes at those intersection points, but I would have to look into the PostGIS library to figure out which command would give us the expected results. UrbanRuralMilageComparison.xlsx
I have additional data to add to this question. I ran an analysis of the Amtrak Cascades, using the May 2017 database and the shapefiles from that i have from that same database, and came up with the following results: Tool: Rural Route Miles: ArcGIS 66.92 TNA Tool: 66.76 Urban Route Miles ArcGIS: 56.61 TNA Tool: 56.77
I was going to chock in up to a tiny error, but then I checked the Albany Transit system: Route Miles Total: 55.69 Urban 50.42 Rural .71. That represent a 4.5 mile difference, which is going to be significant across 65 agencies.
I pulled a subset of results for agencies out, and have included them in a sheet below. Outside of the errors caused by out of state travel (see C-TRAN and Coast Starlight), if we can figure out what makes the agencies which lose alot of miles different from the that don't, maybe we can figure out the cause of the issue and start to fix it. I'm pretty convinced at this point that the problem is related to the data, because if the tool was causing a consistent error, I think we would expect the errors between the different agencies to be more uniform.
This demonstrates the general problem that I've encountered, from an analysis in ArcMap. The pink area is the Roseburg Urban Boundary area, the green lines are the transit lines that have been classified as rural, the red are the portions of the the system that have been classified as urban, by either clipping or erasing the urban boundary from the transit routes, and the black lines on the image are the census block boundaries from the Census Bureau. I've circled the little pieces that don't appear to be categorized correctly, but I'm still not sure why they're being mislabeled, or how to fix it. I'm not sure that fragments this small would result in the differences in numbers that we're seeing, but it's potentially a symptom of this problem.
I think I may have a fix. It appears to be an issue with the gtfs trip shapes having self-intersections. When a line is ST_Intersection'd with another geography such as a census block or state, the intersection is dissolved, and a lower total distance is reported.
Here is a query that demonstrates this (for 3 routes in the Albany bus system used in the issue above):
SELECT DISTINCT ON (gtfs_trips.route_id)
gtfs_trips.route_id,
gtfs_trips.id,
census.stateid,
(ST_Length(ST_Transform(gtfs_trips.shape, 2993)) / 1609.34) AS length_self,
(ST_Length(ST_Transform(ST_Intersection(gtfs_trips.shape, gtfs_trips.shape), 2993)) / 1609.34) AS length_self_intersection,
(ST_Length(ST_Transform(ST_UnaryUnion(gtfs_trips.shape), 2993)) / 1609.3) as length_self_union,
(ST_Length(ST_Transform(ST_Intersection(census.shape, gtfs_trips.shape), 2993)) / 1609.34) AS length_census_intersection
FROM
census_states AS census,
gtfs_trips
WHERE ST_Intersects(census.shape, gtfs_trips.shape) AND gtfs_trips.agencyid = '456'
ORDER BY gtfs_trips.route_id, length DESC;
Output
route_id | id | stateid | length_self | length_self_intersection | length_self_union | length_census_intersection
----------+------------------+---------+------------------+--------------------------+-------------------+----------------------------
3099 | 128099A2591B3851 | 41 | 21.6528281794289 | 19.4581107991802 | 19.4585944407833 | 19.4581107991802
3100 | 128102A2591B3851 | 41 | 16.8541847365424 | 16.7115511618067 | 16.7119665362219 | 16.7115511618067
3101 | 128110A2591B3851 | 41 | 17.174500072783 | 14.9656590265065 | 14.9660310058522 | 14.9656590265065
The gtfs_trips.length
value is the same as length_self
here - a simple ST_Length
of the shape from shapes.txt
. This is the value that is used in the "Total Route Miles" reported in the reporting tool.
However, if you intersect the trip shape with any other geometry (including itself, which should be a perfect match), you get a slightly shorter distance. This is because the intersection dissolves the line's self-intersections. The reporting tool intersects with census blocks and reports the sum of these values for rural and urban miles. So, essentially that is why they don't sum up to the same value. In fact the numbers work out perfectly:
sum_length_self | sum_length_intersection | sum_length_self_union | sum_length_census_intersection
------------------+-------------------------+-----------------------+--------------------------------
55.6815129887543 | 51.1353209874933 | 51.1353209874933 | 51.1353209874933
Where 55.68151 - 51.13532 = 4.54, exactly the amount of route miles that are missing.
PR coming.
Attached is a query showing each route_id and the different measures of length, sorted by the difference between the plain ST_Length of the trip shape vs. ST_Length(ST_Union(trip.shape)). About 1/2 of all routes have no difference; another 1/3 have a small difference, and a small number of routes have a large difference. However, all differences are positive.
-- see https://gis.stackexchange.com/questions/218154/postgis-different-result-using-st-length-and-st-intersection
COPY(
SELECT * FROM (
SELECT DISTINCT ON (gtfs_trips.route_id)
gtfs_trips.agencyid,
gtfs_trips.route_id,
gtfs_trips.id,
census.stateid,
(ST_Length(ST_Transform(gtfs_trips.shape, 2993)) / 1609.34) AS length_self,
(ST_Length(ST_Transform(ST_UnaryUnion(gtfs_trips.shape), 2993)) / 1609.34) as length_self_union,
(ST_Length(ST_Transform(ST_Intersection(census.shape, gtfs_trips.shape), 2993)) / 1609.34) AS length_census_intersection,
Round( ( (ST_Length(ST_Transform(gtfs_trips.shape, 2993)) / 1609.34) - (ST_Length(ST_Transform(ST_UnaryUnion(gtfs_trips.shape), 2993)) / 1609.34) )::NUMERIC , 5) AS length_difference_self_union
FROM
census_states AS census,
gtfs_trips
WHERE
ST_Intersects(census.shape, gtfs_trips.shape)
ORDER BY gtfs_trips.route_id, gtfs_trips.length DESC
) AS a
ORDER BY a.length_difference_self_union DESC
)
TO STDOUT WITH CSV HEADER DELIMITER ',';
This mostly shows up when a single trip travels the same stretch of road twice. That segment gets dissolved in ST_Intersection and only counted towards the length once.
Looking for a way to perform the intersection while preserving the overlapping segments of the trip...
@irees random thought. Dissolve the trips into individual line segments using ST_DUMP, followed by ST_Instersect with the County polygon, followed by ST_Length, and then SUM the total lengths of all intersecting line segments.
@ed-g Yup - I ended up with something similar. I broke the routes into individual line segments, then use that for the ST_Intersect with the census blocks. This way, each individual segment is correctly intersected and counted towards the total length without dissolving duplicates.
route_segments AS (
SELECT
routes.id as id,
n - 1 as seg_id,
ST_MakeLine(ST_PointN(routes.tripshape, n - 1), ST_PointN(routes.tripshape, n)) as shape
FROM routes -- already filtered in a previous select
CROSS JOIN generate_series(2, ST_NPoints(routes.tripshape)) as n
),
rrtmiles AS (
SELECT
ST_Length(ST_Collect(ST_Intersection(route_segments.shape, census_blocks.shape))::geography) / 1609.34 AS rrtmiles
FROM route_segments
JOIN census_blocks ON ST_Intersects(route_segments.shape, census_blocks.shape)
WHERE census_blocks.poptype = 'R'
),
I added a fix for this particular report. There are several other places where a similar spatial join is made; I will try to find a generalized solution.
I have a more generalized and performant fix on the way. It does require the creation of a relatively simple additional table; I will add creating this table to the updateTrips
method and create it manually in the existing databases.
@irees I just loaded up the report using the URL that I specified, and it defaulted to September 2017 database. I tried to change it to the April 2018 db using the db selector, and that functionality appears to be greyed out, which isn't an expected functionality for me. In looking at the original URL and the report from the 90 version of the tool, the reports appear to match up, but the September 17 to April 18 route numbers, specifically the urban route length increase significantly. This could be because of a service change, but I can't find evidence to suggest that, which means that it changed both calculations. From the discussion around dissolving of overlapping segments, in the urban/rural split but not in the route totals, I;m not sure that this is an expected resulted. The two reports of interest are below:
56.7+26.31 = 83.01 /= 84.64
and
26.44+64.68 = 91.12 /= 100.27
@PPaulsonOregonDOT The update is now deployed and the database updated; the sum of the values should now be generally within 1/10th of a mile of each other (unless the agency includes route miles in other states.)
When looking at the agency detailed report, the urban route miles + rural route miles don't equal the total route miles. The numbers are not widely different, and the difference varies by agency, SMART: http://tna.trilliumtransit.com:8080/TNAtoolAPI-Webapp/AgenXReport.html?&agency=101&x=0.25&l=2&n=izr&dbindex=10&popYear=2010&areaid=null&type=0&username=admin&geotype=-1&geoid=null&nav=stateS-agencyS
Urban Route Miles 56.7 Rural Route Miles 26.31 Total 83.01 Listed Routes Miles 84.64
If I had to take a guess, I would suspect that this is related to the accuracy and density of vertices provided in shapes.txt such that the line segments that cross the urban/rural boundary don't get counted because they can't be categorized strictly urban or rural.