Open henare opened 6 years ago
This isn't happening on recent scrapes and looking at the scraped at times on this data my guess is that the scheduled job was set to run every 10 minutes on the 10 minutes, which means it sometimes missed the reading that was happening at that moment and got a duplicate. I know it's now set every 10 minutes on the 5 minutes so maybe @equivalentideas can confirm if this was changed?
I think it could have been changed around March 20 10:00 am local time.
Here's an idea of the scale of duplicates:
[33] pry(main)> AqmRecord.where(location_name: 'Concord Oval AQM').count - AqmRecord.where(location_name: 'Concord Oval AQM').distinct(:latest_reading_recorded_at).count
=> 542
[34] pry(main)>
These are a problem because they'll affect the averaging (not by a huge amount but still worth sorting out).
I know it's now set every 10 minutes on the 5 minutes so maybe @equivalentideas can confirm if this was changed?
My memory thinks it was originally a few minutes past the hour, but I do remember noticing that their were only five distinct recordings a hour.
I think March 20 might have been when I upgraded the database, which required switching the job off for a few minutes. That would explain the shift.
Does it appear that consistently, once an hour, they were slow to publish their results? Or I think I thought that they just weren't adding a fresh recording. Hard to know.
Is that helpful @henare?
I just watched what happens with the runs around 50min past, and on the hour.
Does it appear that consistently, once an hour, they were slow to publish their results?
It looks to me like this is the case.
40079,Haberfield Public School AQM,2018-05-01 21:41:30 +1000,2018-05-01 21:30:00 +1000
40085,Haberfield Public School AQM,2018-05-01 21:51:31 +1000,2018-05-01 21:40:00 +1000
40091,Haberfield Public School AQM,2018-05-01 22:03:28 +1000,2018-05-01 12:00:00 +1000
I checked http://airodis.ecotech.com.au/westconnex/index.html?site=0&station=0 and they did display a reading for 21:50:00, but not until about 21:53/4.
Here we ran a bit later than usual after the hour (22:03) because I hit edit on the dyno :S but this might actually be a good thing. We're now running at :05, :15, :25, which might give the time provided to catch that last run? I don't think this will be a problem, so I'm going to leave it now even though that was an unexpected change.
We're now running at :05, :15, :25, which might give the time provided to catch that last run?
Looks like it. This morning we're collecting:
40490,Powells Creek AQM,2018-05-01 23:05:53,2018-05-01 13:00:00
40484,Powells Creek AQM,2018-05-01 22:55:56,2018-05-01 12:50:00
40478,Powells Creek AQM,2018-05-01 22:45:25,2018-05-01 12:40:00
40472,Powells Creek AQM,2018-05-01 22:35:37,2018-05-01 12:30:00
40466,Powells Creek AQM,2018-05-01 22:25:29,2018-05-01 12:20:00
40460,Powells Creek AQM,2018-05-01 22:15:46,2018-05-01 12:10:00
40454,Powells Creek AQM,2018-05-01 22:05:30,2018-05-01 12:00:00
40448,Powells Creek AQM,2018-05-01 21:55:29,2018-05-01 11:50:00
40442,Powells Creek AQM,2018-05-01 21:45:55,2018-05-01 11:40:00
40436,Powells Creek AQM,2018-05-01 21:35:26,2018-05-01 11:30:00
40430,Powells Creek AQM,2018-05-01 21:25:38,2018-05-01 11:20:00
40424,Powells Creek AQM,2018-05-01 21:15:33,2018-05-01 11:10:00
40418,Powells Creek AQM,2018-05-01 21:05:54,2018-05-01 11:00:00
40412,Powells Creek AQM,2018-05-01 20:55:36,2018-05-01 10:50:00
40406,Powells Creek AQM,2018-05-01 20:45:45,2018-05-01 10:40:00
40400,Powells Creek AQM,2018-05-01 20:35:27,2018-05-01 10:30:00
40394,Powells Creek AQM,2018-05-01 20:25:41,2018-05-01 10:20:00
40388,Powells Creek AQM,2018-05-01 20:15:51,2018-05-01 10:10:00
40382,Powells Creek AQM,2018-05-01 20:05:43,2018-05-01 10:00:00
I've been thinking about this lately and I've got a proposal to deal with both the problems this issue describes. I'm not certain they're the right way to go so this is here for discussion at first.
An option for duplicate readings would be to not save them, i.e. do something like Rails' #find_or_create_by
on all of the measurement and last_reading_recorded_at values. It's not really useful for us to keep recording the same reading over and over, is it?
For missing readings we could increase the scraping frequency to every 5 minutes. This should mean we'll never miss a reading but would require some handing of duplicates (such as the proposal above). This is even more feasible now that we're scraping JSON files instead of using PhantomJS as the scraper is now almost 10 times quicker (less than 2 seconds versus about 12 seconds before).
In answering #24 I noticed that each hour has a duplicate scraped reading on the hour and is missing the reading at 50 minutes past the hour.