emmambd commented 2 months ago

Describe the problem

Due to issues with normalizing our location data for search, and user feedback for us to generate location from the GTFS feed, we want to determine location of a feed based on the GTFS data itself.

Proposed solution

The goal of this issue is to decide on an approach for automatically generating the location from a GTFS feed.

Current plan to approach:

Try several different APIs/packages and calculate location based on a random distributed sampling of stops in stops.txt
Should include Country, Region, and Municipality. If there are more than 10 of any value (e.g 10+ municipalities), only regions should be shown.

Alternatives you've considered

No response

Additional context

No response

cka-y commented 2 months ago

Investigation Results

As I started to extract the locations from the geolocation information, I focused on extracting the locations of the following points:

Center of the bounding box
Minimum latitude and longitude
Maximum latitude and longitude

To extract the locations from geo-coordinates, I utilized the Nominatim reverse geocoding capabilities. More information can be found here: Nominatim API.

Here are the full results indicating whether any location was matched with the inputed location using any of the methods for each feed: matching_results.csv

Statistics:

The table below shows the accuracy of matches for country codes, subdivisions, and municipalities using different methods. The percentages indicate how often the extracted locations matched the original locations.

Method	Country Code Match	Subdivision Match	Municipality Match
Center of the Bounding Box	97.3%	73.7%	28.7%
Maximum Latitude/Longitude	93.6%	67.2%	9.4%
Minimum Latitude/Longitude	92.4%	66.2%	9.7%
Any of the methods	98.2%	75.6%	31.4%

Next Steps

Taking into account that the method using the center of the bounding box is almost as effective as computing the location with all three methods, I propose we proceed with this approach for the first iteration. Additionally, the Nominatim API accepts an Accept-Language header that we can specify as en for English. This will allow us to obtain location information in both the default language of the region and in English. To implement this, we will need to:

Modify the database entity to support multiple languages.
Add a link between the Location entity and the GTFSDataset entity, as the location will now be extracted from the bounding box of the dataset.
Modify/set the location in the datasets by calling the Nominatim API in both the default and English languages. This can be done as part of the extract_bb cloud function, which could be renamed to extract_location.
Modify the API response for the feed to set the location as the latest dataset location if that dataset exists; otherwise, resort to the location information extracted from the mobility-database-catalog.
Modify the materialized view utilized by the search endpoint to fix issues related to searching for locations in English.

emmambd commented 2 months ago

@cka-y 👏 This is perfect! I think we had an outstanding decision about whether to use the municipality that was manually input vs. rely on Nominatum's system. I'll do a deep dive tomorrow into both the subdivision and municipality data to see the discrepancies and make a decision about a path forward.

I did a quick review of the country_code FALSE matches, and it looks like they are mainly 3 different use cases (no action needs to be taken on this, but will be helpful in the future when we want to do more complex location calculations to document!)

Feeds that include commuter rail lines or ferries that span multiple countries in Europe
Feeds that don't have a bounding box in QA, so I can't check the issue
Feeds where the actual bounding box looks wrong (e.g a regional/county feed in England whose bounding box extends to the bottom of Africa)

I think #3 might actually be a bonus of this approach (hey, you can't find this feed because the location is super wrong! maybe consider fixing it). Once we have a transit provider search option, we'll be able to highlight these easily

1 is an actual use case to resolve in the future, but fine for now.

emmambd commented 2 months ago

@cka-y Update re: subdivision (didn't have time to get to muncipality today). After removing all the FALSE cases where the issue was spelling, language differences, both the original subdivision and new subdivision were blank, or the manual input was incorrect, the rate of subdivision match is actually 90%, instead of 73.7%.

Cases where it's wrong are because of:

Bounding boxes that are actually wrong, due to a point_near_origin error
Feeds for intercity or interregion rail or bus lines, where the centre of the bounding box does not indicate where the transit provider is "based"
Feeds where it looks like Nominatim has gaps in their subdivision list (particularly for Finland and Thailand - looks like there are several cases where municipality is right but the subdivision is empty. There are also cases where it looks like the subdivision code is set, but the subdivision full name is empty.

This isn't a blocker, but I'm wondering if it's possible to check the number of subdivisions or municipalities that are included within the bounding box, and if we compute past, say, 5 subdivisions, then we don't include subdivision at all and only include country code.

cka-y commented 2 months ago

@emmambd

Feeds where it looks like Nominatim has gaps in their subdivision list (particularly for Finland and Thailand - looks like there are several cases where municipality is right but the subdivision is empty. There are also cases where it looks like the subdivision code is set, but the subdivision full name is empty.

Would you mind providing the list of feeds where this issue occurs? I can review them individually. It’s possible that the subdivision information is located in another field, as I currently extract it from the state or province fields in the response if they exist. These places might use different terminology for those areas.

Regarding the proposed solution of utilizing a threshold of 5 for different subdivisions, it could indeed work. However, it might significantly increase processing time. For feeds where all locations are the same, this approach would require traversing all the stops, which could be overly cumbersome and inefficient.

As an alternative, we could consider using the area of the bounding box as a temporary fix or selecting a uniformly distributed subset of, say, 10 stops per feed. If at least 50% (5 points) return different subdivisions, we set the subdivision to null. Otherwise, we use the majority vote of all selected stops. This approach would need further investigation but could offer a more balanced solution.

To expedite future investigations, we might want to select a subset of feeds for analysis.

emmambd commented 2 months ago

@cka-y

1) re: Nominatim gaps: I've added all the feeds that didn't have a subdivision in Nominatim but there was in the original data to review.

2) I think this is a common enough issue that it's worth investigation (though not a blocker for implementing this). Particularly thinking about feeds with multiple municipalities when there should only be state/province displayed. It should be timeboxed at 1-2 days max for exploration.

emmambd commented 2 months ago

Decision made in #618 - @cka-y I'd like to see the resulting data in some way again when you adjust the formula/script

MobilityData / mobility-feed-api

Investigate approach to generating location programatically from stops.txt #564

Describe the problem

Proposed solution

Alternatives you've considered

Additional context

Investigation Results

Statistics:

Next Steps

1 is an actual use case to resolve in the future, but fine for now.