Closed emmambd closed 2 months ago
As I started to extract the locations from the geolocation information, I focused on extracting the locations of the following points:
To extract the locations from geo-coordinates, I utilized the Nominatim reverse geocoding capabilities. More information can be found here: Nominatim API.
Here are the full results indicating whether any location was matched with the inputed location using any of the methods for each feed: matching_results.csv
The table below shows the accuracy of matches for country codes, subdivisions, and municipalities using different methods. The percentages indicate how often the extracted locations matched the original locations.
Method | Country Code Match | Subdivision Match | Municipality Match |
---|---|---|---|
Center of the Bounding Box | 97.3% | 73.7% | 28.7% |
Maximum Latitude/Longitude | 93.6% | 67.2% | 9.4% |
Minimum Latitude/Longitude | 92.4% | 66.2% | 9.7% |
Any of the methods | 98.2% | 75.6% | 31.4% |
Taking into account that the method using the center of the bounding box is almost as effective as computing the location with all three methods, I propose we proceed with this approach for the first iteration. Additionally, the Nominatim API accepts an Accept-Language
header that we can specify as en
for English. This will allow us to obtain location information in both the default language of the region and in English. To implement this, we will need to:
Location
entity and the GTFSDataset
entity, as the location will now be extracted from the bounding box of the dataset.extract_bb
cloud function, which could be renamed to extract_location
.mobility-database-catalog
.@cka-y 👏 This is perfect! I think we had an outstanding decision about whether to use the municipality that was manually input vs. rely on Nominatum's system. I'll do a deep dive tomorrow into both the subdivision and municipality data to see the discrepancies and make a decision about a path forward.
I did a quick review of the country_code FALSE matches, and it looks like they are mainly 3 different use cases (no action needs to be taken on this, but will be helpful in the future when we want to do more complex location calculations to document!)
I think #3 might actually be a bonus of this approach (hey, you can't find this feed because the location is super wrong! maybe consider fixing it). Once we have a transit provider search option, we'll be able to highlight these easily
@cka-y Update re: subdivision (didn't have time to get to muncipality today). After removing all the FALSE cases where the issue was spelling, language differences, both the original subdivision and new subdivision were blank, or the manual input was incorrect, the rate of subdivision match is actually 90%, instead of 73.7%.
Cases where it's wrong are because of:
This isn't a blocker, but I'm wondering if it's possible to check the number of subdivisions or municipalities that are included within the bounding box, and if we compute past, say, 5 subdivisions, then we don't include subdivision at all and only include country code.
@emmambd
Feeds where it looks like Nominatim has gaps in their subdivision list (particularly for Finland and Thailand - looks like there are several cases where municipality is right but the subdivision is empty. There are also cases where it looks like the subdivision code is set, but the subdivision full name is empty.
Would you mind providing the list of feeds where this issue occurs? I can review them individually. It’s possible that the subdivision information is located in another field, as I currently extract it from the state
or province
fields in the response if they exist. These places might use different terminology for those areas.
Regarding the proposed solution of utilizing a threshold of 5 for different subdivisions, it could indeed work. However, it might significantly increase processing time. For feeds where all locations are the same, this approach would require traversing all the stops, which could be overly cumbersome and inefficient.
As an alternative, we could consider using the area of the bounding box as a temporary fix or selecting a uniformly distributed subset of, say, 10 stops per feed. If at least 50% (5 points) return different subdivisions, we set the subdivision to null
. Otherwise, we use the majority vote of all selected stops. This approach would need further investigation but could offer a more balanced solution.
To expedite future investigations, we might want to select a subset of feeds for analysis.
@cka-y
1) re: Nominatim gaps: I've added all the feeds that didn't have a subdivision in Nominatim but there was in the original data to review.
2) I think this is a common enough issue that it's worth investigation (though not a blocker for implementing this). Particularly thinking about feeds with multiple municipalities when there should only be state/province displayed. It should be timeboxed at 1-2 days max for exploration.
Decision made in #618 - @cka-y I'd like to see the resulting data in some way again when you adjust the formula/script
Describe the problem
Due to issues with normalizing our location data for search, and user feedback for us to generate location from the GTFS feed, we want to determine location of a feed based on the GTFS data itself.
Proposed solution
The goal of this issue is to decide on an approach for automatically generating the location from a GTFS feed.
Current plan to approach:
Alternatives you've considered
No response
Additional context
No response