equivalentideas / westconnex_M4_East_Air_Quality_Monitoring

WestConnex M4 East - Air Quality Monitoring data
1 stars 1 forks source link

Work out what to do with duplicate and missing readings #41

Open henare opened 6 years ago

henare commented 6 years ago

In answering #24 I noticed that each hour has a duplicate scraped reading on the hour and is missing the reading at 50 minutes past the hour.

henare commented 6 years ago

This isn't happening on recent scrapes and looking at the scraped at times on this data my guess is that the scheduled job was set to run every 10 minutes on the 10 minutes, which means it sometimes missed the reading that was happening at that moment and got a duplicate. I know it's now set every 10 minutes on the 5 minutes so maybe @equivalentideas can confirm if this was changed?

I think it could have been changed around March 20 10:00 am local time.

henare commented 6 years ago

Here's an idea of the scale of duplicates:

[33] pry(main)> AqmRecord.where(location_name: 'Concord Oval AQM').count - AqmRecord.where(location_name: 'Concord Oval AQM').distinct(:latest_reading_recorded_at).count
=> 542
[34] pry(main)> 

These are a problem because they'll affect the averaging (not by a huge amount but still worth sorting out).

equivalentideas commented 6 years ago

I know it's now set every 10 minutes on the 5 minutes so maybe @equivalentideas can confirm if this was changed?

My memory thinks it was originally a few minutes past the hour, but I do remember noticing that their were only five distinct recordings a hour.

I think March 20 might have been when I upgraded the database, which required switching the job off for a few minutes. That would explain the shift.

Does it appear that consistently, once an hour, they were slow to publish their results? Or I think I thought that they just weren't adding a fresh recording. Hard to know.

Is that helpful @henare?

equivalentideas commented 6 years ago

I just watched what happens with the runs around 50min past, and on the hour.

Does it appear that consistently, once an hour, they were slow to publish their results?

It looks to me like this is the case.

40079,Haberfield Public School AQM,2018-05-01 21:41:30 +1000,2018-05-01 21:30:00 +1000
40085,Haberfield Public School AQM,2018-05-01 21:51:31 +1000,2018-05-01 21:40:00 +1000
40091,Haberfield Public School AQM,2018-05-01 22:03:28 +1000,2018-05-01 12:00:00 +1000

I checked http://airodis.ecotech.com.au/westconnex/index.html?site=0&station=0 and they did display a reading for 21:50:00, but not until about 21:53/4.

Here we ran a bit later than usual after the hour (22:03) because I hit edit on the dyno :S but this might actually be a good thing. We're now running at :05, :15, :25, which might give the time provided to catch that last run? I don't think this will be a problem, so I'm going to leave it now even though that was an unexpected change.

equivalentideas commented 6 years ago

We're now running at :05, :15, :25, which might give the time provided to catch that last run?

Looks like it. This morning we're collecting:

40490,Powells Creek AQM,2018-05-01 23:05:53,2018-05-01 13:00:00
40484,Powells Creek AQM,2018-05-01 22:55:56,2018-05-01 12:50:00
40478,Powells Creek AQM,2018-05-01 22:45:25,2018-05-01 12:40:00
40472,Powells Creek AQM,2018-05-01 22:35:37,2018-05-01 12:30:00
40466,Powells Creek AQM,2018-05-01 22:25:29,2018-05-01 12:20:00
40460,Powells Creek AQM,2018-05-01 22:15:46,2018-05-01 12:10:00
40454,Powells Creek AQM,2018-05-01 22:05:30,2018-05-01 12:00:00
40448,Powells Creek AQM,2018-05-01 21:55:29,2018-05-01 11:50:00
40442,Powells Creek AQM,2018-05-01 21:45:55,2018-05-01 11:40:00
40436,Powells Creek AQM,2018-05-01 21:35:26,2018-05-01 11:30:00
40430,Powells Creek AQM,2018-05-01 21:25:38,2018-05-01 11:20:00
40424,Powells Creek AQM,2018-05-01 21:15:33,2018-05-01 11:10:00
40418,Powells Creek AQM,2018-05-01 21:05:54,2018-05-01 11:00:00
40412,Powells Creek AQM,2018-05-01 20:55:36,2018-05-01 10:50:00
40406,Powells Creek AQM,2018-05-01 20:45:45,2018-05-01 10:40:00
40400,Powells Creek AQM,2018-05-01 20:35:27,2018-05-01 10:30:00
40394,Powells Creek AQM,2018-05-01 20:25:41,2018-05-01 10:20:00
40388,Powells Creek AQM,2018-05-01 20:15:51,2018-05-01 10:10:00
40382,Powells Creek AQM,2018-05-01 20:05:43,2018-05-01 10:00:00
henare commented 6 years ago

I've been thinking about this lately and I've got a proposal to deal with both the problems this issue describes. I'm not certain they're the right way to go so this is here for discussion at first.

An option for duplicate readings would be to not save them, i.e. do something like Rails' #find_or_create_by on all of the measurement and last_reading_recorded_at values. It's not really useful for us to keep recording the same reading over and over, is it?

For missing readings we could increase the scraping frequency to every 5 minutes. This should mean we'll never miss a reading but would require some handing of duplicates (such as the proposal above). This is even more feasible now that we're scraping JSON files instead of using PhantomJS as the scraper is now almost 10 times quicker (less than 2 seconds versus about 12 seconds before).