Closed barbeau closed 7 years ago
@barbeau The iteration numbers are shown
For duplicate iterations we are saving the iteration to GtfsRtFeedIteration
table but not validating and logging errors to MessageLog
table.
So, to see the iteration numbers in order, we need to stop saving duplicate records at all in all the tables.
If we stop saving duplicate records, what all would we lose in terms of functionality? I think just the total HTTP request count, right? Anything else?
I definitely want to keep the number of HTTP requests and a count of duplicate vs. unique records as we'll likely need this to calculate things like frequency of feed updates, so if we stop saving duplicate records we'd need to figure out another way to store this.
If we kept duplicate records, but still wanted to change the Iteration numbers, I believe we'd need to set our own IDs for what we're referring to as Iteration numbers. However, I think this would be as simple as setting GtfsRtFeedIterationModel.IterationId
manually, instead of it being a @GeneratedValue(strategy=GenerationType.IDENTITY)
, right?
Which solution do you think is better - stop storing duplicate records and find another way to count HTTP requests, or setting IterationID
manually?
Sean, I believe it's better to stop storing duplicate records and find another way to count HTTP requests
. We can store the count of HTTP requests for each feed in table GtfsRtFeed
Another question would be, does the iteration number for every new feed should start from 1?
Because currently, for a new feed, the iteration number starts by incrementing the iteration number of last feed ran.
The main problem I see with not storing duplicate records, though, is that we can't determine how frequently a particular feed was being monitored over multiple sessions. So, what if we see 100 records that are 10 minutes apart in the log, and then the next 100 records that are 5 minutes apart in the log - did the user use two different intervals to monitor the first 100 vs. second 100, or did the feed start updating itself more frequently? I think these details would be harder to capture by storing aggregate values, rather than calculating them from all the records in the database (duplicates and uniques). This gets more complicated with multiple users (see below). Do you have more thoughts on this?
Another question would be, does the iteration number for every new feed should start from 1?
Yes. The more I think about this in context of multiple clients (https://github.com/CUTR-at-USF/gtfs-realtime-validator/issues/101), the more I think we should treat each session where the user visits the browser and startings monitoring a feed as unique - in other words, everything starts at 0 again, including number of HTTP requests, unique responses, iteration ID, etc. My original thought process behind sharing feeds over sessions was that it would be more efficient if multiple people were monitoring the same feed, and it gives them an easy way to view old output, but this gets complicated if people are using two different GTFS files but same GTFS-rt URL, etc. I think it would be cleaner just to start over, and maybe we can add a feature to see old session results.
Actually - how are we defining "new feed" here?
Based on foreign key rtFeedID
in GtfsRtFeedIteration
table. rtFeedID
is primary key in GtfsRtFeed
table.
Does a new record get added to the GtfsRtFeed
table each time you enter new feeds in the "Welcome" screen? Or does it re-use an existing record if the same GTFS-rt feed already exists?
It will re-use an existing record if the same GTFS-rt feed already exists
Does GtfsRtFeedModel.startTime
get reset? Or is it only set the first time the feed is added?
Looks like it gets overwritten each time the user starts a monitoring session, in GtfsRtFeed.postGtfsRtFeed()
, via feedInfo.setStartTime(TimeStampHelper.getCurrentTimestamp());
?
We are not updating. If we find a GTFS-rt record that already exists, we are not updating startTime
.
Yah, that is a bug. We need to update startTime
if we already have a record.
We are not updating. If we find a GTFS-rt record that already exists, we are not updating startTime.
Where are we checking if the feed already exists for this, in terms of class and method?
Let's wait on this issue for now - let's discuss more in Friday's meeting. I think we may need to modify our data model a bit, and it will be easier to talk through this in person.
In GtfsRtFeed.postGtfsRtFeed()
we are trying to check whether the record already exists or not by querying here
GtfsRtFeedModel storedFeedInfo = (GtfsRtFeedModel) session.createQuery(" FROM GtfsRtFeedModel WHERE "
+ "gtfsUrl= '"+feedInfo.getGtfsUrl()+"' AND gtfsFeedModel.feedId = "+feedInfo.getGtfsFeedModel().getFeedId()).uniqueResult();
GTFSDB.commitAndCloseSession(session);
if(storedFeedInfo == null) { //If null, create the gtfs-rt feed in the DB and return the feed
session = GTFSDB.InitSessionBeginTrans();
session.save(feedInfo);
GTFSDB.commitAndCloseSession(session);
}
and if storedFeedInfo
is not null, we are not updating startTime
.
Let's wait on this issue for now - let's discuss more in Friday's meeting. I think we may need to modify our data model a bit, and it will be easier to talk through this in person.
Sure Sean. That would be better.
and if storedFeedInfo is not null, we are not updating startTime.
Ah, you're right. I assumed this was getting updated because of Line 68, but you're right, it only updates if it's null. Ok - yes, let's discuss more on Friday.
After discussion, final agreement for now is to create a new sessions table with a session ID and start and end time that is unique per user session, and pulls back a set of records from GtfsRtFeedIterationModel
MessageLogModel
based on GtfsRtFeedIterationModel.timestamp
. When we pull back that set of records, we'll have a "row ID" that will be shown in the UI as the "Iteration ID" - something like:
IterationID, row Id, message Id, error message
234, 4...
233, 3...
232, 2...
231, 1...
Iteration ID would still be the primary key for GtfsRtFeedIterationModel and would be used to pull back the plain text version of the feed when clicking on "Iteration ID" in the UI.
@barbeau In this issue, along with having rowIds
, are we just creating the sessions table so that it can be used in the future enhancements or we also need to add values to that table?
For this issue, let's hold off on the actual sessions table for now. I'll see if I can put together a more complete proposal for users and sessions tomorrow.
For this issue, the important part is that we're only pulling back records from GtfsRtFeedIterationModel
MessageLogModel
where GtfsRtFeedIterationModel.timestamp
is greater than the starting time of when the user starts monitoring a feed. So, the user only sees records from the session they just started, and not records from previous monitoring sessions.
Note that the "record ID" here should be transient - it shouldn't actually be a field in the database. When I say "record ID", I'm referring to pulling a set of records from GtfsRtFeedIterationModel
MessageLogModel
where GtfsRtFeedIterationModel.timestamp
is greater than the session starting time, and the record with the lowest timestamp in that set should show up as iteration "1" in the UI.
Actually, sorry, I think I got the above class name wrong - the "record ID" should apply to the records that populate the Log table - I think that should be MessageLogModel
and not GtfsRtFeedIterationModel
, right? The timestamp used to filter the records would be pulled from GtfsRtFeedIterationModel
, though. Does that make sense?
Also, here's an ERD for the data model as it currently exists (generated from IntelliJ Ultimate Edition and hibernate.cfg.xml
in the "Persistence" window):
Yes Sean. So, we need to pull the records starting from current time when the user started monitoring feed. And the record ID should start from 1 in the Log table without being adding an extra record ID field in the GtfsRtFeedIterationModel
or MessageLogModel
tables.
Sounds good!
Also, just to clarify, if the same error is generated by the same GTFS-realtime iteration, those two error should still have the same ID in the Log UI. So if the first GTFS-realtime iteration of a new session generates two errors, both of those would have ID "1" in the UI.
Yes Sean. We already have that in our existing Log tables. Will do the same with record IDs too. Thank you for mentioning :)
Oh, now we need to change everything displayed in monitoring.html
based on current time stamp i.e., Total HTTP requests, Total Unique responses and requests and responses related to each feed.
Correct!
Summary:
The iteration numbers shown in the "Log" are incorrect. They currently represent the total number of HTTP requests, not unique responses.
For example, in the below screenshot there have been 4 total HTTP requests, but only 2 unique responses. The Iteration column shows two records:
They should be:
Steps to reproduce:
Load any feed and start monitoring it
Expected behavior:
Iteration numbers should only increment on unique responses, not total HTTP requests. I'd expect the Iteration numbers in above screenshot to be:
Observed behavior:
Iteration numbers are being determined by total HTTP requests. Iteration numbers in above screenshot are:
Platform:
Windows 7 Enterprise with jdk1.8.0_73