Support for multiple data sources

MrKrisKrisu commented 1 year ago

Is your feature request related to a problem? Please describe. Currently, Traewelling relies solely on DB-Rest asthe only data source, which is a wrapper for the HAFAS of Deutsche Bahn. This limits our data availability to Germany and a few major connections in neighboring countries. Trams, buses, and other public transportation modes in foreign countries are often not included. To provide a more comprehensive service, we need to expand our data sources to include information from multiple providers.

Describe the solution you'd like We aim to enhance Träwelling by integrating multiple data sources to gather public transportation data. This would allow users to check in for rides across borders. We are seeking data sources that cover a wide range of locations, both within and outside of Germany, to provide users with a more extensive coverage.

Describe alternatives you've considered /

Additional context /

Expanding our data sources would greatly improve the usability and inclusivity of our service, allowing users to benefit from a wider range of public transportation options both domestically and internationally. Any suggestions or recommendations for reliable data sources with comprehensive coverage would be greatly appreciated.

Please feel free to contribute any relevant information, ideas, or suggestions for potential data sources to this issue. Thank you for your support!

Data Source	Area	Link / Context
DB Rest / HAFAS DB	Germany + some foreign trips	https://github.com/derhuerst/db-rest
HAFAS for other areas	...	https://gist.github.com/derhuerst/2b7ed83bfa5f115125a5 (Thanks @derhuerst)
EFA for some german areas		https://www.kvv.de/fahrplan/fahrplaene/open-data.html, https://www.vbn.de/service/entwicklerinfos/opendata-und-openservice, https://www.connect-fahrplanauskunft.de/index.php?id=opendata
SBB	Switzerland	https://data.sbb.ch/explore/?sort=modified&refine.keyword=Verkehr
ÖBB	Austria	https://data.oebb.at/#default/home

derhuerst commented 1 year ago

The transport-apis project has many transit APIs listed; It intends to be the "source of truth" for basic information about these APIs (their endpoints, authentication mechanisms, licensing scheme, etc.), so that projects don't need to keep track of these changes each individually. If there is anything missing over there, please create an Issue or submit a PR!

derhuerst commented 1 year ago

Regarding the actual idea being discussed here: I think that many tricky technical and UX questions arise once starts having >1 underlying data source:

Shall the data sources be completely separate? E.g. when I check into a train/trip as represented by the DB HAFAS, and another person checks into that (same real-world) train/trip as represented by an SNCF data source, will we see each other as being on the same train/trip?
If we have built a mechanism to identify two data items as being about the same (one real-world) train/trip, do we form a new "proprietary" ID that "masks" the underlying DB/SNCF IDs? If we do this, then we need to either a) keep a mapping between them for a long time, or b) make the new ID contain the underlying data source IDs somehow.
If we have tackled the above items, how do we make sure the UX is not confusing. Let's assume we have decided to either a) merge the properties from both data sources about one real-world item, or b) to decide to show only one set of properties. How do we make sure users can find the train/trip they're looking for if they're used to a very specific naming scheme (e.g. "RE 1" vs "RE 73793", "TGV INOUI 123" vs "TGV 123")?

I have brainstormed more about some technical aspects topic in Why linked open transit data?, stable-public-transport-ids, and experimented with fusing >1 (HAFAS-like) data source in pan-european-public-transport.

TLDR: Adding another data source is technically feasable, but how do we create a usable UX from that?

HerrLevin commented 1 year ago

I'm currently working on a really hacky POC to inject GTFS data into the DB-Rest response so that we might be able to combine multiple data sources without having to drastically change our internal project's structure. The repo will be made publically available around the start of the GPN next week.

Currently, it's forwarding the departure request directly to db-rest v5 while simultaneously searching for departures on that IBNR. The departures provided via GTFS are then injected into the JSON. To determine what endpoint to call when we're getting a journey request, I simply took inspiration from the current HAFAS-Trip-IDs and added a "GTFS|{gtfs-id}" prefix to the trip IDs. This might be extended to combine multiple APIs from multiple (overlapping) data sources, but the first step might be, to add ÖBB, SNCF, SBB, etc., and restrict them to regular public transport like busses and trams, which are not covered by DB's HAFAS system.

I might have a few ideas to combat your above-mentioned problems:

In our case: (mostly) yes. We want to use the "official" data endpoint for one vehicle, e.g. Karlsruhe public transport uses their open data endpoint, ICEs use DB Hafas, TGVs use SNCF's and so on (This adds one bigger question: What do we do with trains crossing borders? Is the TGV-Data provided by SNCF more or less accurate than the DB's? Just guessing by the DB's polylines, everything outside of state lines is "bad data")
This will be done w/ a proprietary combination of some proprietary prefixes and the API's original ID.
This is the biggest question in my opinion b/c it just opens even more questions. My current ideas are the following:
- We need to keep track of which APIs should be used for which station. This could be done by using a modified version of the GTFS stops table. A general primary identifier could be IFOPT as the parent station with the APIs internal station ID and a reference to the station as children. Maybe even additional information such as "only long-distance trains" could be added.
- In my opinion, the "correct way" of displaying the line name, etc. is using what the "correct" API is providing. However, this could be extended by providing additional information in some sort of translation schema since it will indeed be confusing to end users in some situations. I'm not completely happy with this approach but it's the best I came up with until now.

This is all in its infancy at the moment but already describes the rough direction I'd like to go.

P.S.: speaking of GPN - will we see you there? 👀

vainamov commented 1 year ago

It's unfortunately limited to trains within Finland, but the Fintraffic API is awesome: https://www.digitraffic.fi/en/railway-traffic/

derhuerst commented 1 year ago

I'm currently working on a really hacky POC to inject GTFS data into the DB-Rest response so that we might be able to combine multiple data sources without having to drastically change our internal project's structure. The repo will be made publically available around the start of the GPN next week.

Currently, it's forwarding the departure request directly to db-rest v5 while simultaneously searching for departures on that IBNR. The departures provided via GTFS are then injected into the JSON. To determine what endpoint to call when we're getting a journey request, I simply took inspiration from the current HAFAS-Trip-IDs and added a "GTFS|{gtfs-id}" prefix to the trip IDs. This might be extended to combine multiple APIs from multiple (overlapping) data sources, but the first step might be, to add ÖBB, SNCF, SBB, etc., and restrict them to regular public transport like busses and trams, which are not covered by DB's HAFAS system.

This is very similar to what I've been doing with match-gtfs-rt-to-gtfs: It tries to match data from a HAFAS API (e.g. the DB one) to a GTFS dataset by matching their stop/trip/route IDs/names/locations.

Over time, I've invested quite a lot of effort to make the matching logic fast and flexible enough. For example, it can match a HAFAS stop with a GTFS stop even when they don't share an ID (IBNR), have slightly different names, and slightly different geolocations.

Unfortunately, the code has many indirections and isn't well-documented. Also, it's been a while since I've tested it with the DB HAFAS endpoint. But if you're interested, take a look!

do we form a new "proprietary" ID that "masks" the underlying DB/SNCF IDs? This will be done w/ a proprietary combination of some proprietary prefixes and the API's original ID.

You might also want to look into Multiformats as a generalized and future-proof mechanism for "combining IDs".

[…] how do we make sure the UX is not confusing. […] How do we make sure users can find the train/trip they're looking for if they're used to a very specific naming scheme (e.g. "RE 1" vs "RE 73793", "TGV INOUI 123" vs "TGV 123")? We need to keep track of which APIs should be used for which station. […] A general primary identifier could be IFOPT as the parent station with the APIs internal station ID and a reference to the station as children. […]

The Trainline stations database might be very helpful with this.

Traewelling / traewelling

Support for multiple data sources #1635