CUTR-at-USF / gtfs-realtime-validator

Java-based tool that validates General Transit Feed Specification (GTFS)-realtime feeds. See https://github.com/MobilityData/gtfs-realtime-validator for the latest!
Other
93 stars 40 forks source link

Don't save duplicate PB feed files to local database #92

Closed barbeau closed 7 years ago

barbeau commented 7 years ago

Summary:

Currently, every time we poll the GTFS-rt feed we do the following:

  1. Retrieve the GTFS-rt PB file and run it through the rule validation
  2. Hash it and compare it to the hash of the most recent GTFS-rt pb file in our local database for this feed URL
  3. If the feeds are different, we set a boolean variable indicating that the new feed is unique (i.e., its the first time we've seen that instance of the feed)
  4. Store the new feed PB file in our database

This behavior was implemented in https://github.com/CUTR-at-USF/gtfs-realtime-validator/pull/77, and is used to update the "HTTP requests" (count of all records) and "Unique responses" (count of all records that have the boolean unique field set to true) fields at the top of the web UI. This allows us to easily get an idea of how frequently the GTFS-rt feed is producing new data.

@mohangandhiGH Please correct me if I got any of the above wrong.

Thinking more about this, though, I don't know if there is really a need to store every single PB file that we retrieve from the GTFS-rt server in the database. Many of these PB files are exact copies of the previous PB file that's been stored in the database, and off the top of my head I can't think of a use case where we'd need the exact duplicate binary file.

An alternate approach could be to only store the PB file if it's unique, and leave that field empty if the PB file is a duplicate of the previously retrieved PB file. In this case, we could also store a hash of the PB file in the database for all records. To calculate the number of HTTP requests we would get a count of all records, and to calculate the number of unique responses we would count the number of records where the PB file field is null. We could also count the number of duplicate records for any given PB file by counting the number of records where the hash field is equal to the hash of the PB file.

@mohangandhiGH What are your thoughts on this? Could you please take a look at how much space these PB files take in the DB and run some numbers on how much storage space we would save if we only store unique PB files?

Steps to reproduce:

Start monitoring a feed

Expected behavior:

Only persist new PB files to the database (maybe? Let's look at some numbers first...)

Observed behavior:

Every PB file that is retrieved from a server is stored to the database, even if its a duplicate of the last PB file that was retrieved from the GTFS-rt feed.

mohangandhiGH commented 7 years ago

@barbeau Yes. We can definitely go with this idea. This saves lots of memory. Considering the application run on MBTA data, where each PB file is 680KB, we can easily save 1.6GB data if the application ran for about 3000 iterations where it contains around 2000 duplicates.

A small correction. To calculate the number of unique responses we would count the number of records where the PB file/hash field is not null. We could also count the number of duplicate records for any given PB file by counting the number of records where the PB file field/hash field is null.

barbeau commented 7 years ago

@mohangandhiGH ok, great! Please go ahead and tackle this then! :)