Closed isabelle-dr closed 9 months ago
Thank you for your reporting a bug. The issue has been placed in triage, the MobilityData team will follow-up on it.
Posting on the behalf of Marcy Jaffe with the National RTAP.
I'd like to offer training and recommend your MD Schedule Validator tools if it might be possible to reconsider as warning (vs. error) for equal_shape_distance_diff_coordinates
When Google Validates it is a warning
For Mobility Data it is an error - which I will need to advise in my trainings that they can ignore which might mean they want to ignore other errors - which I do not want
It will be nearly impossible and offer very little benefit to go in for literally 100 rows of data and delete the nearby points. Riders will not have a much better experience and I believe some agencies will not want to manage their GTFS! Might you please consider revising this advisory to a warning?
I am tempted to downgrade this notice, and propose a modification to the spec from:
Values must increase along with shape_pt_sequence
to:
Values must not decrease along with
shape_pt_sequence
@bdferris-v2 thoughts?
Is it possible to configure the rules' severities when using it? (Without rebuilding the .jar
/Docker image.)
For my use cases, I'd like to have the option to treat equal_shape_distance_diff_coordinates
as an error, even if it is decided here that it should be a warning.
If this is possible already, why not let people who focus on small/rural GTFS providers re-configure equal_shape_distance_diff_coordinates
to warning? I'm not saying that their perspective isn't relevant, but I think there is a trade-off to be made, between small providers – who might not have the technical resources to produce high-quality GTFS feeds – and big metropolitan or even national providers – who should encouraged to follow the spec (and its implicit intentions) rather strictly. Because I see this range of sophistication as inevitable, I'd rather opt for more strict defaults.
@isabelle-dr do you have a GTFS data set that exhibits this issue?
@KClough I have requested it! I agree with @derhuerst's point of view, this should stay an error in this validator (unless we change the spec).
Is it possible to configure the rules' severities when using it?
We are working on it 🙃
I have the impression that some vendor tools create this issue in a systemic way. It might be worth digging into how shapes.txt
is created...
An immediate action item might be to update the documentation to explain to users when it's acceptable to ignore this issue.
Edit on what I've just said: I am not entirely sure error is the right severity, and I'd like to take a data-driven approach to figure this out. @KClough: can you get the list of all the datasets in the Mobility Database that trigger this notice? We can attempt to draw patterns based on what we see in production.
@KClough here are three datasets that trigger this rule and also the equal_shape_distance_same_coordinates
.
Interior Alaska https://www.dropbox.com/s/5es8jexp0qpmsxd/interiorak_google_transit.zip?dl=0
This feed also has warning https://www.dropbox.com/s/h16ny11hlln3k9h/centraltransit_google_transit.zip?dl=0
While this agency has related warning & error https://www.dropbox.com/s/8c8zjbp89fff0do/makah_google_transit.zip?dl=0
And here is Marcy's answer to my question: how to you create the shape files.
Rural agencies without a GIS staff product higher quality GTFS with shapes generated using MyMaps - guided by the stops along the way - in the direction of travel >> see this file
https://www.google.com/maps/d/edit?mid=1WuMlxgYa-NCZLZAuyQdwFhDnq6gvsGk&usp=sharing
and then export the shape as KML , name the shape_id
At times a point is near another point along the route & voila - an error I've tried to search for duplicate values and delete - while with multiple routes there are too many values & this was not a quick fix
A comment from our slack channel on this issue
Because the shape is a GPS trace, is it possible to quantize the coordinates to the 1m level, ie remove the second coordinate with the same dist
It looks like a reasonable next step is to see how this user is generating shapes and if the file can be cleaned-up. I can commit to doing this in the next few weeks. @KClough, I'd still be interested to get the list of datasets that trigger this notice to have a closer look
@isabelle-dr Currently the notice data does not include the lat, long for the affected row. Adding that would be one way of getting all the data needed to test the Mobility Database data in the compiled reports that are run as part of the github actions and we could then potentially use some JSON querying tools to check stats on that ouput. This could be done as a test PR if it's not desirable to add those fields into the official output.
Alternatively or additionally, the Cal-ITP project could potentially query this information for the feeds in their database, if we want to make a request to them, but would be limited to California data.
After a discussion with @qcdyx, the strategy to solve this issue is:
shape_dist_traveled
field is the same.actualDistanceBetweenShapePoints
field from all datasets from the Mobility Database that trigger equal_shape_distance_diff_coordinates
.frequency
on the y
and shape_dist_traveled
value on the x
. Then, assess based on what we see:
equal_shape_distance_same_coordinates
(which is a warning)Do we have agreement on downgrading equal_shape_distance_diff_coordinates from error to warning?
Based on my observation, equal_shape_distance_diff_coordinates happens when two consecutive points are very close. For example, the actualDistanceBetweenShapePoints for previous point (lat 48.36919, long -124.63073) and current point (lat 48.36919, long -124.63074) is 0, so these two consecutive points have equal shape_dist_traveled
and but different lat/lon coordinates inshapes.txt
. Based on the Haversine formula, the distance between these two points is approximately 0.5702 meters, which is very close to 0. The getDistance method that used by GTFS validator is from com.google.common.geometry.S2LatLng. It does some internal rounding or precision limitations and might not handle very close points accurately.
I prefer downgrading equal_shape_distance_diff_coordinates to a warning than searching for other substitute geometry libraries. @isabelle-dr @emmambd
@qcdyx Hey Jingsi! The goal right now is to conduct analytics on when equal_shape_distance_diff_coordinates to make a decision — it's too soon to decide about severity at the moment without doing an evaluation of the Mobility Database feeds that we use in acceptance tests, as specified in the next step heading here
The getDistance method that used by GTFS validator is from com.google.common.geometry.S2LatLng. It does some internal rounding or precision limitations and might not handle very close points accurately.
This is interesting! I believe up to this point we thought that the validator was using the actual shape_dist_traveled points defined by the feeds, not doing any additional interpretation. Let's talk about this more offline and then I'll circle back here to document next steps.
An update on our approach on this PR: https://github.com/MobilityData/gtfs-validator/pull/1675#issuecomment-1955249386
Current approach is to implement a threshold of 1.11m on distances between shape point pairs for the ERROR (to capture any "same" values that result from precision/rounding issues at 5 decimal places for lat long values) , and create a WARNING for any distances that are less than that.
We plan to include this in the upcoming release, and will only take next steps to make this threshold more permissive if we receive user feedback on it.
I'm going to close issue based on #1675 and re-open it if there is new community feedback after this release that indicates we should make the threshold more permissive.
Problem We hear from users that
equal_shape_distance_diff_coordinates
(which is currently an error) is often present in datasets that contain shapes, and the work needed to fix this issue in the datasets gives an incentive for users not to use shapes at all.This rule was initially implemented in PR #1083, alongside two others:
decreasing_shape_distance
: error, andequal_shape_distance_same_coordinates
: warning,with the intention of validating the shapes.xt Reference:
What to do Re-visit if the conditions that trigger
equal_shape_distance_diff_coordinates
should really be an error: talk to the community, and analyze production data. Consider lowering the severity to a warning and opening a discussion in the specification to make it clearer.Next Steps from Most Recent Comment
After a discussion with @qcdyx, the strategy to solve this issue is:
we are assuming that a portion of these notices come from a precision issue of the software creating shape files: there are two very close shape points that have distinct lat/lon values, but the shape_dist_traveled field is the same. pull the actualDistanceBetweenShapePoints field from all datasets from the Mobility Database that trigger equal_shape_distance_diff_coordinates. plot it on a histogram with frequency on the y and latitude and longitude diff value on the x. Then, assess based on what we see: Spreadsheet values (example from @cka-y's past analytics work in https://github.com/MobilityData/mobility-database-catalogs/pull/275/files)
Once this spreadsheet is created, we can see if there's a common threshold for actualDistanceBetweenShapePoints (how far apart are they typically for feeds generating this error?)
do we have a clear threshold that has the majority of the values below it? if so, would it be reasonable to consider values before the threshold as equal_shape_distance_same_coordinates (which is a warning) if so: does this need a spec amendment?
Additional Context These three rules were initially created to replace the
decreasing_or_equal_shape_distance
notice because this rule was triggered by two things that deserved to be treated differently:shape_dist_traveled
decreases between two consecutive shape points (which is a clear violation of the spec)shape_dist_traveled
is equal between two consecutive shape points (also a violation but is not as big of a problem)By digging deeper into number 2 above, we noticed that we were seeing two cases in production data: 2.1
shape_dist_traveled
is equal between two consecutive shape points and the lat/long coordinates are equal (which seems fine) 2.2shape_dist_traveled
is equal between two consecutive shape points and the lat/long coordinates are not equal (which seems like a problem, but it could be caused by the scheduling software that roundsshape_dist_traveled
when the two shape points are really close)We went ahead and made our own interpretation of the specification based on what we saw in the production data: condition 2.1 would be a warning, whereas conditions 1 & 2.2 would be errors, which is slightly less strict than the spec that strictly mentions "must increase".