Closed nukeador closed 1 month ago
Maybe a good approach to avoid the database to unnecessarily grow ad infinitum is to use the trip update timestamp to decide when to free space in the DB.
So you would be able to config:
As far as I see, all existing data is being marked as "stale" each time you run updateGtfsRealtime() which is also not ideal if the data has not changed and it could benefit from merging new RT data with the existing one and just overwrite if the timestamp has changed.
Other strategies I've found to handle this issue by other projects
- Time-based pruning: You typically only store real-time data (vehicle positions, trip updates, etc.) for a limited time window, such as the last few hours or days. Older data is periodically removed from the database to control its growth.
- Schedule-based pruning: You can config to only store real-time data for trips that are currently in progress or upcoming, based on the scheduled service dates and times. Once a trip is completed, its real-time data is removed.
- Differential updates: Instead of storing the full state for every update, you can store only the differences (deltas) between consecutive updates. This reduces the amount of data that needs to be persisted over time.
- Data compaction: You can compact and consolidate real-time data records by merging multiple updates for the same vehicle or trip into a single record, further reducing storage requirements.
Check out this branch: https://github.com/BlinkTagInc/node-gtfs/tree/feature/gtfs-realtime-expiration and specifically this commit https://github.com/BlinkTagInc/node-gtfs/compare/master...feature/gtfs-realtime-expiration
I added a new config option gtfsRealtimeExpirationSeconds
which allows you to control when GTFS-Realtime data gets deleted. This can be used to prevent data from being removed from the database when running the gtfsrealtime-update
command.
If omitted from config, all data is deleted every time the command is run.
Note that the update command uses "REPLACE INTO" so if there is updated data in the GTFS-Realtime the database is updated with this information to avoid duplicates. But if there is not any updated data for that item, the old value should persist in the database.
Try this out and let me know what you think.
Thanks, will definitely test it this week!
@brendannee when testing that branch I'm getting all stop_time_updates entries imported as null and many of these errors on the console:
Downloading GTFS-Realtime from http://212.170.201.204:50080/GTFSRTapi/api/alert
Download successful
Warning: Import error: NOT NULL constraint failed: service_alert_targets.alert_id
Importing - 1 entries imported
Warning: Import error: NOT NULL constraint failed: service_alert_targets.alert_id
Importing - 3 entries imported
Warning: Import error: NOT NULL constraint failed: service_alert_targets.alert_id
Importing - 5 entries imported
Warning: Import error: NOT NULL constraint failed: service_alert_targets.alert_id
I tried the following config just now and it worked without errors - and I see imported service_alerts and service_alert_targets in the database. The message you printed above looks like it was having an issue saving service_alert_targets to that table with no alert_id. Can you try again and let me know if you see the same issue or if it went away?
Can you
{
"agencies": [
{
"agency_key": "AUVASA",
"url": "http://212.170.201.204:50080/GTFSRTapi/api/GTFSFile",
"realtimeUrls": [
"http://212.170.201.204:50080/GTFSRTapi/api/tripupdate",
"http://212.170.201.204:50080/GTFSRTapi/api/vehicleposition",
"http://212.170.201.204:50080/GTFSRTapi/api/alert"
]
}
],
"sqlitePath": "/tmp/gtfs"
}
The issue remains the same, I suspect there is a conflict if you already had a sqlite file created by a previous node-gtfs version? I deleted the sqlite and let the new version create one and I got the same issue:
Starting GTFS-Realtime refresh for 4 agencies using SQLite database at lib/gtfs/database.sqlite
Removed expired GTFS-Realtime data
Starting GTFS-Realtime import from 3 urls
Downloading GTFS-Realtime from http://212.170.201.204:50080/GTFSRTapi/api/tripupdate
Importing - 1903 entries imported
Downloading GTFS-Realtime from http://212.170.201.204:50080/GTFSRTapi/api/vehicleposition
Importing - 77 entries imported
Downloading GTFS-Realtime from http://212.170.201.204:50080/GTFSRTapi/api/alert
Download successful
Warning: Import error: NOT NULL constraint failed: service_alert_targets.alert_id
Importing - 1 entries imported
Warning: Import error: NOT NULL constraint failed: service_alert_targets.alert_id
Importing - 3 entries imported
Warning: Import error: NOT NULL constraint failed: service_alert_targets.alert_id
Importing - 5 entries imported
Warning: Import error: NOT NULL constraint failed: service_alert_targets.alert_id
Importing - 7 entries imported
Warning: Import error: NOT NULL constraint failed: service_alert_targets.alert_id
Importing - 9 entries imported
Warning: Import error: NOT NULL constraint failed: service_alert_targets.alert_id
Importing - 11 entries imported
Warning: Import error: NOT NULL constraint failed: service_alert_targets.alert_id
Importing - 13 entries imported
Warning: Import error: NOT NULL constraint failed: service_alert_targets.alert_id
Importing - 15 entries imported
Warning: Import error: NOT NULL constraint failed: service_alert_targets.alert_id
Importing - 17 entries imported
Warning: Import error: NOT NULL constraint failed: service_alert_targets.alert_id
Importing - 19 entries imported
Warning: Import error: NOT NULL constraint failed: service_alert_targets.alert_id
Importing - 21 entries imported
Warning: Import error: NOT NULL constraint failed: service_alert_targets.alert_id
Importing - 23 entries imported
Warning: Import error: NOT NULL constraint failed: service_alert_targets.alert_id
Importing - 25 entries imported
Warning: Import error: NOT NULL constraint failed: service_alert_targets.alert_id
Importing - 27 entries imported
GTFS-Realtime data import complete
Thanks for the details.
I found the issue in the commit and updated the branch https://github.com/BlinkTagInc/node-gtfs/tree/feature/gtfs-realtime-expiration
Can you try out this updated branch and let me know if there are any issues and if it works how you'd expect it to?
If there are any other configuration parameters or things you wish it did, let me know.
Error is no longer present, testing the cache now, I'll report back if I find any issues. Thanks so much!
@brendannee no issues so far with the cache, it seems it respects the config. I've been doing tests with 60 and 90s, which are the max recommended values for caching GTFS realtime data. I also tested 600s and found no issues.
Additionally I'll be doing a test with 24h cache, in order to be able to store a historial log of realtime arrivals from each day.
@brendannee Hi there. Is there additional test that might help you ship this into the main branch? Just let me know, thanks!
I actually merged this in already, so it is live on the main branch in the latest release: https://github.com/BlinkTagInc/node-gtfs/releases/tag/4.13.4 - I just forgot to tag this issue thread to let you know!
Try it out and let me know if there are any issues or ways to improve it.
@brendannee maybe it would be worth mentioning on the readme that if new data arrives for the same key, it will overwrite it and the cache is only for not overwritten keys?
Great point - I updated the readme https://github.com/blinktaginc/node-gtfs?tab=readme-ov-file#gtfsrealtimeexpirationseconds
I suggest that cleaning stale GTFS-Realtime data should be disabled by default and configurable via config.js
The reason is that many agencies provide very unreliable tripupdates, in some updates there is no data for some trips and then on the next one data is present.
This makes node-gtfs report no data for some trips that use to have data a few seconds ago.
Other tools like OpenTripPlanner defaults to not purging expired data.