CUTR-at-USF / gtfs-realtime-validator

Java-based tool that validates General Transit Feed Specification (GTFS)-realtime feeds. See https://github.com/MobilityData/gtfs-realtime-validator for the latest!
Other
92 stars 40 forks source link

Feed iteration numbers are incorrect #120

Closed barbeau closed 7 years ago

barbeau commented 7 years ago

Summary:

The iteration numbers shown in the "Log" are incorrect. They currently represent the total number of HTTP requests, not unique responses.

For example, in the below screenshot there have been 4 total HTTP requests, but only 2 unique responses. The Iteration column shows two records:

They should be:

image

Steps to reproduce:

Load any feed and start monitoring it

Expected behavior:

Iteration numbers should only increment on unique responses, not total HTTP requests. I'd expect the Iteration numbers in above screenshot to be:

Observed behavior:

Iteration numbers are being determined by total HTTP requests. Iteration numbers in above screenshot are:

Platform:

Windows 7 Enterprise with jdk1.8.0_73

mohangandhiGH commented 7 years ago

@barbeau The iteration numbers are shown

For duplicate iterations we are saving the iteration to GtfsRtFeedIteration table but not validating and logging errors to MessageLog table.

So, to see the iteration numbers in order, we need to stop saving duplicate records at all in all the tables.

barbeau commented 7 years ago

If we stop saving duplicate records, what all would we lose in terms of functionality? I think just the total HTTP request count, right? Anything else?

I definitely want to keep the number of HTTP requests and a count of duplicate vs. unique records as we'll likely need this to calculate things like frequency of feed updates, so if we stop saving duplicate records we'd need to figure out another way to store this.

If we kept duplicate records, but still wanted to change the Iteration numbers, I believe we'd need to set our own IDs for what we're referring to as Iteration numbers. However, I think this would be as simple as setting GtfsRtFeedIterationModel.IterationId manually, instead of it being a @GeneratedValue(strategy=GenerationType.IDENTITY), right?

Which solution do you think is better - stop storing duplicate records and find another way to count HTTP requests, or setting IterationID manually?

mohangandhiGH commented 7 years ago

Sean, I believe it's better to stop storing duplicate records and find another way to count HTTP requests. We can store the count of HTTP requests for each feed in table GtfsRtFeed

mohangandhiGH commented 7 years ago

Another question would be, does the iteration number for every new feed should start from 1?

Because currently, for a new feed, the iteration number starts by incrementing the iteration number of last feed ran.

barbeau commented 7 years ago

The main problem I see with not storing duplicate records, though, is that we can't determine how frequently a particular feed was being monitored over multiple sessions. So, what if we see 100 records that are 10 minutes apart in the log, and then the next 100 records that are 5 minutes apart in the log - did the user use two different intervals to monitor the first 100 vs. second 100, or did the feed start updating itself more frequently? I think these details would be harder to capture by storing aggregate values, rather than calculating them from all the records in the database (duplicates and uniques). This gets more complicated with multiple users (see below). Do you have more thoughts on this?

Another question would be, does the iteration number for every new feed should start from 1?

Yes. The more I think about this in context of multiple clients (https://github.com/CUTR-at-USF/gtfs-realtime-validator/issues/101), the more I think we should treat each session where the user visits the browser and startings monitoring a feed as unique - in other words, everything starts at 0 again, including number of HTTP requests, unique responses, iteration ID, etc. My original thought process behind sharing feeds over sessions was that it would be more efficient if multiple people were monitoring the same feed, and it gives them an easy way to view old output, but this gets complicated if people are using two different GTFS files but same GTFS-rt URL, etc. I think it would be cleaner just to start over, and maybe we can add a feature to see old session results.

barbeau commented 7 years ago

Actually - how are we defining "new feed" here?

mohangandhiGH commented 7 years ago

Based on foreign key rtFeedID in GtfsRtFeedIteration table. rtFeedID is primary key in GtfsRtFeed table.

barbeau commented 7 years ago

Does a new record get added to the GtfsRtFeed table each time you enter new feeds in the "Welcome" screen? Or does it re-use an existing record if the same GTFS-rt feed already exists?

mohangandhiGH commented 7 years ago

It will re-use an existing record if the same GTFS-rt feed already exists

barbeau commented 7 years ago

Does GtfsRtFeedModel.startTime get reset? Or is it only set the first time the feed is added?

barbeau commented 7 years ago

Looks like it gets overwritten each time the user starts a monitoring session, in GtfsRtFeed.postGtfsRtFeed(), via feedInfo.setStartTime(TimeStampHelper.getCurrentTimestamp());?

mohangandhiGH commented 7 years ago

We are not updating. If we find a GTFS-rt record that already exists, we are not updating startTime.

mohangandhiGH commented 7 years ago

Yah, that is a bug. We need to update startTime if we already have a record.

barbeau commented 7 years ago

We are not updating. If we find a GTFS-rt record that already exists, we are not updating startTime.

Where are we checking if the feed already exists for this, in terms of class and method?

barbeau commented 7 years ago

Let's wait on this issue for now - let's discuss more in Friday's meeting. I think we may need to modify our data model a bit, and it will be easier to talk through this in person.

mohangandhiGH commented 7 years ago

In GtfsRtFeed.postGtfsRtFeed() we are trying to check whether the record already exists or not by querying here

GtfsRtFeedModel storedFeedInfo = (GtfsRtFeedModel) session.createQuery(" FROM GtfsRtFeedModel WHERE "
                + "gtfsUrl= '"+feedInfo.getGtfsUrl()+"' AND gtfsFeedModel.feedId = "+feedInfo.getGtfsFeedModel().getFeedId()).uniqueResult();
        GTFSDB.commitAndCloseSession(session);
        if(storedFeedInfo == null) {    //If null, create the gtfs-rt feed in the DB and return the feed
            session = GTFSDB.InitSessionBeginTrans();
            session.save(feedInfo);
            GTFSDB.commitAndCloseSession(session);
        }

and if storedFeedInfo is not null, we are not updating startTime.

mohangandhiGH commented 7 years ago

Let's wait on this issue for now - let's discuss more in Friday's meeting. I think we may need to modify our data model a bit, and it will be easier to talk through this in person.

Sure Sean. That would be better.

barbeau commented 7 years ago

and if storedFeedInfo is not null, we are not updating startTime.

Ah, you're right. I assumed this was getting updated because of Line 68, but you're right, it only updates if it's null. Ok - yes, let's discuss more on Friday.

barbeau commented 7 years ago

After discussion, final agreement for now is to create a new sessions table with a session ID and start and end time that is unique per user session, and pulls back a set of records from GtfsRtFeedIterationModel MessageLogModel based on GtfsRtFeedIterationModel.timestamp. When we pull back that set of records, we'll have a "row ID" that will be shown in the UI as the "Iteration ID" - something like:

IterationID, row Id, message Id, error message
234, 4...
233, 3...
232, 2...
231, 1...

Iteration ID would still be the primary key for GtfsRtFeedIterationModel and would be used to pull back the plain text version of the feed when clicking on "Iteration ID" in the UI.

mohangandhiGH commented 7 years ago

@barbeau In this issue, along with having rowIds, are we just creating the sessions table so that it can be used in the future enhancements or we also need to add values to that table?

barbeau commented 7 years ago

For this issue, let's hold off on the actual sessions table for now. I'll see if I can put together a more complete proposal for users and sessions tomorrow.

For this issue, the important part is that we're only pulling back records from GtfsRtFeedIterationModel MessageLogModel where GtfsRtFeedIterationModel.timestamp is greater than the starting time of when the user starts monitoring a feed. So, the user only sees records from the session they just started, and not records from previous monitoring sessions.

Note that the "record ID" here should be transient - it shouldn't actually be a field in the database. When I say "record ID", I'm referring to pulling a set of records from GtfsRtFeedIterationModel MessageLogModel where GtfsRtFeedIterationModel.timestamp is greater than the session starting time, and the record with the lowest timestamp in that set should show up as iteration "1" in the UI.

barbeau commented 7 years ago

Actually, sorry, I think I got the above class name wrong - the "record ID" should apply to the records that populate the Log table - I think that should be MessageLogModel and not GtfsRtFeedIterationModel, right? The timestamp used to filter the records would be pulled from GtfsRtFeedIterationModel, though. Does that make sense?

barbeau commented 7 years ago

Also, here's an ERD for the data model as it currently exists (generated from IntelliJ Ultimate Edition and hibernate.cfg.xml in the "Persistence" window):

image

mohangandhiGH commented 7 years ago

Yes Sean. So, we need to pull the records starting from current time when the user started monitoring feed. And the record ID should start from 1 in the Log table without being adding an extra record ID field in the GtfsRtFeedIterationModel or MessageLogModel tables.

barbeau commented 7 years ago

Sounds good!

barbeau commented 7 years ago

Also, just to clarify, if the same error is generated by the same GTFS-realtime iteration, those two error should still have the same ID in the Log UI. So if the first GTFS-realtime iteration of a new session generates two errors, both of those would have ID "1" in the UI.

mohangandhiGH commented 7 years ago

Yes Sean. We already have that in our existing Log tables. Will do the same with record IDs too. Thank you for mentioning :)

mohangandhiGH commented 7 years ago

Oh, now we need to change everything displayed in monitoring.html based on current time stamp i.e., Total HTTP requests, Total Unique responses and requests and responses related to each feed.

barbeau commented 7 years ago

Correct!