BlinkTagInc / node-gtfs

Import GTFS transit data into SQLite and query routes, stops, times, fares and more.
MIT License
435 stars 149 forks source link

Disable clean stale GTFS-Realtime data by default #164

Closed nukeador closed 1 month ago

nukeador commented 3 months ago

I suggest that cleaning stale GTFS-Realtime data should be disabled by default and configurable via config.js

The reason is that many agencies provide very unreliable tripupdates, in some updates there is no data for some trips and then on the next one data is present.

This makes node-gtfs report no data for some trips that use to have data a few seconds ago.

Other tools like OpenTripPlanner defaults to not purging expired data.

If there are any gaps or delays in receiving realtime feed updates, not purging expired data prevents inadvertent data loss that could impact routing results. The outdated data is still better than having no data at all for those time periods

nukeador commented 3 months ago

Maybe a good approach to avoid the database to unnecessarily grow ad infinitum is to use the trip update timestamp to decide when to free space in the DB.

So you would be able to config:

As far as I see, all existing data is being marked as "stale" each time you run updateGtfsRealtime() which is also not ideal if the data has not changed and it could benefit from merging new RT data with the existing one and just overwrite if the timestamp has changed.

nukeador commented 3 months ago

Other strategies I've found to handle this issue by other projects

  • Time-based pruning: You typically only store real-time data (vehicle positions, trip updates, etc.) for a limited time window, such as the last few hours or days. Older data is periodically removed from the database to control its growth.
  • Schedule-based pruning: You can config to only store real-time data for trips that are currently in progress or upcoming, based on the scheduled service dates and times. Once a trip is completed, its real-time data is removed.
  • Differential updates: Instead of storing the full state for every update, you can store only the differences (deltas) between consecutive updates. This reduces the amount of data that needs to be persisted over time.
  • Data compaction: You can compact and consolidate real-time data records by merging multiple updates for the same vehicle or trip into a single record, further reducing storage requirements.
brendannee commented 3 months ago

Check out this branch: https://github.com/BlinkTagInc/node-gtfs/tree/feature/gtfs-realtime-expiration and specifically this commit https://github.com/BlinkTagInc/node-gtfs/compare/master...feature/gtfs-realtime-expiration

I added a new config option gtfsRealtimeExpirationSeconds which allows you to control when GTFS-Realtime data gets deleted. This can be used to prevent data from being removed from the database when running the gtfsrealtime-update command.

If omitted from config, all data is deleted every time the command is run.

Note that the update command uses "REPLACE INTO" so if there is updated data in the GTFS-Realtime the database is updated with this information to avoid duplicates. But if there is not any updated data for that item, the old value should persist in the database.

Try this out and let me know what you think.

nukeador commented 3 months ago

Thanks, will definitely test it this week!

nukeador commented 3 months ago

@brendannee when testing that branch I'm getting all stop_time_updates entries imported as null and many of these errors on the console:

Downloading GTFS-Realtime from http://212.170.201.204:50080/GTFSRTapi/api/alert
Download successful
Warning: Import error: NOT NULL constraint failed: service_alert_targets.alert_id
Importing - 1 entries imported
Warning: Import error: NOT NULL constraint failed: service_alert_targets.alert_id
Importing - 3 entries imported
Warning: Import error: NOT NULL constraint failed: service_alert_targets.alert_id
Importing - 5 entries imported
Warning: Import error: NOT NULL constraint failed: service_alert_targets.alert_id
brendannee commented 3 months ago

I tried the following config just now and it worked without errors - and I see imported service_alerts and service_alert_targets in the database. The message you printed above looks like it was having an issue saving service_alert_targets to that table with no alert_id. Can you try again and let me know if you see the same issue or if it went away?

Can you

{
  "agencies": [
    {
      "agency_key": "AUVASA",
      "url": "http://212.170.201.204:50080/GTFSRTapi/api/GTFSFile",
      "realtimeUrls": [
        "http://212.170.201.204:50080/GTFSRTapi/api/tripupdate",
        "http://212.170.201.204:50080/GTFSRTapi/api/vehicleposition",
        "http://212.170.201.204:50080/GTFSRTapi/api/alert"
      ]
    }
  ],
  "sqlitePath": "/tmp/gtfs"
}
nukeador commented 3 months ago

The issue remains the same, I suspect there is a conflict if you already had a sqlite file created by a previous node-gtfs version? I deleted the sqlite and let the new version create one and I got the same issue:

Starting GTFS-Realtime refresh for 4 agencies using SQLite database at lib/gtfs/database.sqlite
Removed expired GTFS-Realtime data
Starting GTFS-Realtime import from 3 urls
Downloading GTFS-Realtime from http://212.170.201.204:50080/GTFSRTapi/api/tripupdate
Importing - 1903 entries imported
Downloading GTFS-Realtime from http://212.170.201.204:50080/GTFSRTapi/api/vehicleposition
Importing - 77 entries imported
Downloading GTFS-Realtime from http://212.170.201.204:50080/GTFSRTapi/api/alert
Download successful
Warning: Import error: NOT NULL constraint failed: service_alert_targets.alert_id
Importing - 1 entries imported
Warning: Import error: NOT NULL constraint failed: service_alert_targets.alert_id
Importing - 3 entries imported
Warning: Import error: NOT NULL constraint failed: service_alert_targets.alert_id
Importing - 5 entries imported
Warning: Import error: NOT NULL constraint failed: service_alert_targets.alert_id
Importing - 7 entries imported
Warning: Import error: NOT NULL constraint failed: service_alert_targets.alert_id
Importing - 9 entries imported
Warning: Import error: NOT NULL constraint failed: service_alert_targets.alert_id
Importing - 11 entries imported
Warning: Import error: NOT NULL constraint failed: service_alert_targets.alert_id
Importing - 13 entries imported
Warning: Import error: NOT NULL constraint failed: service_alert_targets.alert_id
Importing - 15 entries imported
Warning: Import error: NOT NULL constraint failed: service_alert_targets.alert_id
Importing - 17 entries imported
Warning: Import error: NOT NULL constraint failed: service_alert_targets.alert_id
Importing - 19 entries imported
Warning: Import error: NOT NULL constraint failed: service_alert_targets.alert_id
Importing - 21 entries imported
Warning: Import error: NOT NULL constraint failed: service_alert_targets.alert_id
Importing - 23 entries imported
Warning: Import error: NOT NULL constraint failed: service_alert_targets.alert_id
Importing - 25 entries imported
Warning: Import error: NOT NULL constraint failed: service_alert_targets.alert_id
Importing - 27 entries imported
GTFS-Realtime data import complete
brendannee commented 3 months ago

Thanks for the details.

I found the issue in the commit and updated the branch https://github.com/BlinkTagInc/node-gtfs/tree/feature/gtfs-realtime-expiration

Can you try out this updated branch and let me know if there are any issues and if it works how you'd expect it to?

If there are any other configuration parameters or things you wish it did, let me know.

nukeador commented 3 months ago

Error is no longer present, testing the cache now, I'll report back if I find any issues. Thanks so much!

nukeador commented 2 months ago

@brendannee no issues so far with the cache, it seems it respects the config. I've been doing tests with 60 and 90s, which are the max recommended values for caching GTFS realtime data. I also tested 600s and found no issues.

Additionally I'll be doing a test with 24h cache, in order to be able to store a historial log of realtime arrivals from each day.

nukeador commented 1 month ago

@brendannee Hi there. Is there additional test that might help you ship this into the main branch? Just let me know, thanks!

brendannee commented 1 month ago

I actually merged this in already, so it is live on the main branch in the latest release: https://github.com/BlinkTagInc/node-gtfs/releases/tag/4.13.4 - I just forgot to tag this issue thread to let you know!

Try it out and let me know if there are any issues or ways to improve it.

nukeador commented 1 week ago

@brendannee maybe it would be worth mentioning on the readme that if new data arrives for the same key, it will overwrite it and the cache is only for not overwritten keys?

brendannee commented 1 week ago

Great point - I updated the readme https://github.com/blinktaginc/node-gtfs?tab=readme-ov-file#gtfsrealtimeexpirationseconds