ODOT-PTS / TNExT

a web-based software tool developed for the visualization, analysis, and reporting of regional and statewide transit networks in the state of Oregon
16 stars 4 forks source link

Inconsistency between route mile measures #67

Closed PPaulsonOregonDOT closed 6 years ago

PPaulsonOregonDOT commented 7 years ago

When looking at the agency detailed report, the urban route miles + rural route miles don't equal the total route miles. The numbers are not widely different, and the difference varies by agency, SMART: http://tna.trilliumtransit.com:8080/TNAtoolAPI-Webapp/AgenXReport.html?&agency=101&x=0.25&l=2&n=izr&dbindex=10&popYear=2010&areaid=null&type=0&username=admin&geotype=-1&geoid=null&nav=stateS-agencyS

Urban Route Miles 56.7 Rural Route Miles 26.31 Total 83.01 Listed Routes Miles 84.64

If I had to take a guess, I would suspect that this is related to the accuracy and density of vertices provided in shapes.txt such that the line segments that cross the urban/rural boundary don't get counted because they can't be categorized strictly urban or rural.

ed-g commented 7 years ago

@srinivas13794 sounds like this was related to agencies with route miles contained outside of Oregon. Am I thinking of the correct issue here?

srinivas13794 commented 7 years ago

@ed-g There are two cases in which you can see a difference between route miles and the sum of urban,rural route miles:

  1. When the agency has service outside the state of Oregon. As the database contains only blocks inside Oregon we can only get the urban and rural route miles for those blocks.
  2. In general in using Post GIS commands such as ST_Intersection,ST_Length there is an error of estimation which is the cause of minor differences between route miles and the sum of urban,rural route miles
PPaulsonOregonDOT commented 7 years ago

@ed-g: Is it possible that there is a datum difference with the two datasets? I know the route shapes are going to come in as WGS 1984 (EPSG 4326). From what I can tell, the urban/rural boundary file provided by the census is provided in North American Datum 83 (EPSG 4269). If they aren't projected to the same projection (ideally Oregom Statewide Lamber (EPSG 2992) , it's possible that it could introduce some error in the intersection calculations. I'm not convinced that this is the issue, but we need to know for future reference.

ed-g commented 7 years ago

@PPaulsonOregonDOT this is a little over my head with regard to GIS. We can research and find out.

ODOT-RPTD-mb commented 7 years ago

We still need to at least get to the point where we are confident we understand the reason for the differences we are seeing. If practical we should also fix this.

ed-g commented 7 years ago

Comments from Chris at Trillium:

Ah I see. That does make it more difficult to deduce. If we assume that Phillip is correct and route shapes is in one datum (WGS 1984 (EPSG 4326)) while the boundary file is in another (North American Datum 83 (EPSG 4269)) then there is a very high likelihood that this is the reason for the discrepancy in total miles. If you have two different datasets in two different projections there will always be some degree of error.

One way to tell for sure would be to have the same datasets projected in the same coordinate system and then run the miles calculation again. I imagine this would be pretty easy for them to test at ODOT.

ed-g commented 7 years ago

This is a research task from my point of view. 4-16 hours guesstimate.

PPaulsonOregonDOT commented 6 years ago

I was able to get access to the OSU version of the database, which I assume is the same, in terms of spatial boundaries, as the Trillium version that is currently "top spec". From looking at the May 2017 database, it looks like there's two sets of files for each relation that would be involved in the urban/rural route miles/service miles calculations. One of these tables has a coordinate reference system defined, one doesn't. Those that have CRS's defined appear all to be EPSG 4326, or WGS 1984, which is the native format for GTFS data. As a result, I think we can conclude that the projection is not the issue, or that if it is, the issue is in the pre-processing of the data.

One observation I had while investigating this issue was that the level of error varied depending on which agency you were looking at, which makes me think that the error might actually rest in the accuracy of the shapes being used, which would suggest that @srinivas13794 was correct in that the SQL command being used is the source of the error. Essentially, if you don't have a vertex for the route shape exactly where it overlaps with the urban/rural boundary, the segment that overlaps with that boundary is going to be left out, because PostGIS uses bounding boxes of the features to determine whether something intersects with the urban boundary or not. There may be a way to pre-process the shapes with the urban boundaries to add vertexes at those intersection points, but I would have to look into the PostGIS library to figure out which command would give us the expected results. UrbanRuralMilageComparison.xlsx

PPaulsonOregonDOT commented 6 years ago

I have additional data to add to this question. I ran an analysis of the Amtrak Cascades, using the May 2017 database and the shapefiles from that i have from that same database, and came up with the following results: Tool: Rural Route Miles: ArcGIS 66.92 TNA Tool: 66.76 Urban Route Miles ArcGIS: 56.61 TNA Tool: 56.77

I was going to chock in up to a tiny error, but then I checked the Albany Transit system: Route Miles Total: 55.69 Urban 50.42 Rural .71. That represent a 4.5 mile difference, which is going to be significant across 65 agencies.

I pulled a subset of results for agencies out, and have included them in a sheet below. Outside of the errors caused by out of state travel (see C-TRAN and Coast Starlight), if we can figure out what makes the agencies which lose alot of miles different from the that don't, maybe we can figure out the cause of the issue and start to fix it. I'm pretty convinced at this point that the problem is related to the data, because if the tool was causing a consistent error, I think we would expect the errors between the different agencies to be more uniform.

PPaulsonOregonDOT commented 6 years ago

This demonstrates the general problem that I've encountered, from an analysis in ArcMap. The pink area is the Roseburg Urban Boundary area, the green lines are the transit lines that have been classified as rural, the red are the portions of the the system that have been classified as urban, by either clipping or erasing the urban boundary from the transit routes, and the black lines on the image are the census block boundaries from the Census Bureau. I've circled the little pieces that don't appear to be categorized correctly, but I'm still not sure why they're being mislabeled, or how to fix it. I'm not sure that fragments this small would result in the differences in numbers that we're seeing, but it's potentially a symptom of this problem. urbanruralfragments

irees commented 6 years ago

I think I may have a fix. It appears to be an issue with the gtfs trip shapes having self-intersections. When a line is ST_Intersection'd with another geography such as a census block or state, the intersection is dissolved, and a lower total distance is reported.

Here is a query that demonstrates this (for 3 routes in the Albany bus system used in the issue above):

SELECT DISTINCT ON (gtfs_trips.route_id)
    gtfs_trips.route_id, 
    gtfs_trips.id,
    census.stateid,
    (ST_Length(ST_Transform(gtfs_trips.shape, 2993)) / 1609.34) AS length_self,
    (ST_Length(ST_Transform(ST_Intersection(gtfs_trips.shape, gtfs_trips.shape), 2993)) / 1609.34) AS length_self_intersection,
    (ST_Length(ST_Transform(ST_UnaryUnion(gtfs_trips.shape), 2993)) / 1609.3) as length_self_union,
    (ST_Length(ST_Transform(ST_Intersection(census.shape, gtfs_trips.shape), 2993)) / 1609.34) AS length_census_intersection
FROM 
    census_states AS census, 
    gtfs_trips
WHERE ST_Intersects(census.shape, gtfs_trips.shape) AND gtfs_trips.agencyid = '456'
ORDER BY gtfs_trips.route_id, length DESC;

Output

 route_id |        id        | stateid |   length_self    | length_self_intersection | length_self_union | length_census_intersection
----------+------------------+---------+------------------+--------------------------+-------------------+----------------------------
 3099     | 128099A2591B3851 | 41      | 21.6528281794289 |         19.4581107991802 |  19.4585944407833 |           19.4581107991802
 3100     | 128102A2591B3851 | 41      | 16.8541847365424 |         16.7115511618067 |  16.7119665362219 |           16.7115511618067
 3101     | 128110A2591B3851 | 41      |  17.174500072783 |         14.9656590265065 |  14.9660310058522 |           14.9656590265065

The gtfs_trips.length value is the same as length_self here - a simple ST_Length of the shape from shapes.txt. This is the value that is used in the "Total Route Miles" reported in the reporting tool.

However, if you intersect the trip shape with any other geometry (including itself, which should be a perfect match), you get a slightly shorter distance. This is because the intersection dissolves the line's self-intersections. The reporting tool intersects with census blocks and reports the sum of these values for rural and urban miles. So, essentially that is why they don't sum up to the same value. In fact the numbers work out perfectly:

 sum_length_self  | sum_length_intersection | sum_length_self_union | sum_length_census_intersection
------------------+-------------------------+-----------------------+--------------------------------
 55.6815129887543 |        51.1353209874933 |      51.1353209874933 |               51.1353209874933

Where 55.68151 - 51.13532 = 4.54, exactly the amount of route miles that are missing.

PR coming.

See https://gis.stackexchange.com/questions/218154/postgis-different-result-using-st-length-and-st-intersection

irees commented 6 years ago

Attached is a query showing each route_id and the different measures of length, sorted by the difference between the plain ST_Length of the trip shape vs. ST_Length(ST_Union(trip.shape)). About 1/2 of all routes have no difference; another 1/3 have a small difference, and a small number of routes have a large difference. However, all differences are positive.

-- see https://gis.stackexchange.com/questions/218154/postgis-different-result-using-st-length-and-st-intersection

COPY(
    SELECT * FROM (
        SELECT DISTINCT ON (gtfs_trips.route_id)
            gtfs_trips.agencyid,
            gtfs_trips.route_id, 
            gtfs_trips.id,
            census.stateid,
            (ST_Length(ST_Transform(gtfs_trips.shape, 2993)) / 1609.34) AS length_self,
            (ST_Length(ST_Transform(ST_UnaryUnion(gtfs_trips.shape), 2993)) / 1609.34) as length_self_union,
            (ST_Length(ST_Transform(ST_Intersection(census.shape, gtfs_trips.shape), 2993)) / 1609.34) AS length_census_intersection,
            Round(    (   (ST_Length(ST_Transform(gtfs_trips.shape, 2993)) / 1609.34)  -  (ST_Length(ST_Transform(ST_UnaryUnion(gtfs_trips.shape), 2993)) / 1609.34)   )::NUMERIC    , 5) AS length_difference_self_union
        FROM 
            census_states AS census, 
            gtfs_trips
        WHERE 
            ST_Intersects(census.shape, gtfs_trips.shape) 
        ORDER BY gtfs_trips.route_id, gtfs_trips.length DESC
    ) AS a
    ORDER BY a.length_difference_self_union DESC
    )
TO STDOUT WITH CSV HEADER DELIMITER ',';

length_debugging2.xlsx

irees commented 6 years ago

This mostly shows up when a single trip travels the same stretch of road twice. That segment gets dissolved in ST_Intersection and only counted towards the length once.

Looking for a way to perform the intersection while preserving the overlapping segments of the trip...

ed-g commented 6 years ago

@irees random thought. Dissolve the trips into individual line segments using ST_DUMP, followed by ST_Instersect with the County polygon, followed by ST_Length, and then SUM the total lengths of all intersecting line segments.

irees commented 6 years ago

@ed-g Yup - I ended up with something similar. I broke the routes into individual line segments, then use that for the ST_Intersect with the census blocks. This way, each individual segment is correctly intersected and counted towards the total length without dissolving duplicates.

    route_segments AS (
        SELECT 
            routes.id as id,
            n - 1 as seg_id,
            ST_MakeLine(ST_PointN(routes.tripshape, n - 1), ST_PointN(routes.tripshape, n)) as shape
        FROM routes -- already filtered in a previous select
        CROSS JOIN generate_series(2, ST_NPoints(routes.tripshape)) as n      
    ),

    rrtmiles AS (
        SELECT 
            ST_Length(ST_Collect(ST_Intersection(route_segments.shape, census_blocks.shape))::geography) / 1609.34 AS rrtmiles
        FROM route_segments
        JOIN census_blocks ON ST_Intersects(route_segments.shape, census_blocks.shape)
        WHERE census_blocks.poptype = 'R'
    ),
irees commented 6 years ago

I added a fix for this particular report. There are several other places where a similar spatial join is made; I will try to find a generalized solution.

irees commented 6 years ago

I have a more generalized and performant fix on the way. It does require the creation of a relatively simple additional table; I will add creating this table to the updateTrips method and create it manually in the existing databases.

PPaulsonOregonDOT commented 6 years ago

@irees I just loaded up the report using the URL that I specified, and it defaulted to September 2017 database. I tried to change it to the April 2018 db using the db selector, and that functionality appears to be greyed out, which isn't an expected functionality for me. In looking at the original URL and the report from the 90 version of the tool, the reports appear to match up, but the September 17 to April 18 route numbers, specifically the urban route length increase significantly. This could be because of a service change, but I can't find evidence to suggest that, which means that it changed both calculations. From the discussion around dissolving of overlapping segments, in the urban/rural split but not in the route totals, I;m not sure that this is an expected resulted. The two reports of interest are below:

http://tna.trilliumtransit.com:8090/TNAtoolAPI-Webapp/AgenXReport.html?&agency=101&x=0.25&l=2&n=cws&dbindex=10&popYear=2010&areaid=null&type=0&username=admin&geotype=-1&geoid=null&nav=stateS-agencyS

56.7+26.31 = 83.01 /= 84.64

and

http://tna.trilliumtransit.com:8090/TNAtoolAPI-Webapp/AgenXReport.html?&agency=101&x=0.25&l=2&n=cws&dbindex=11&popYear=2010&areaid=null&type=0&username=admin&geotype=-1&geoid=null&nav=stateS-agencyS

26.44+64.68 = 91.12 /= 100.27

irees commented 6 years ago

@PPaulsonOregonDOT The update is now deployed and the database updated; the sum of the values should now be generally within 1/10th of a mile of each other (unless the agency includes route miles in other states.)