ODM2 / ODM2DataSharingPortal

A Python-Django web application enabling users to upload, share, and display data from their environmental monitoring sites via the app's ODM2 database. Data can either be automatically streamed from Internet of Things (IoT) devices, manually uploaded via CSV files, or manually entered into forms.
BSD 3-Clause "New" or "Revised" License
31 stars 8 forks source link

Lost MMW readings - system wide? #685

Closed neilh10 closed 2 weeks ago

neilh10 commented 9 months ago

While looking at some readings from MMW - I noticed a time gap in readings. It would appear that approx 5 days worth of data missing across a number of devices.

For https://monitormywatershed.org/sites/SpokaneR-SpokaneValley/ Saw a gap between 2023-09-26 8:45AM PST & 2023-09-30 16:15 Investigating a further 6 other nodes it seems they had all lost data during this period

TUCA_PO03 PST 9/26 7:15am to 9/30 16:00 missing 418 TUCA_Sa01 PST 9/25 23:30 9/30 16:15 missing 450 ~ https://monitormywatershed.org/sites/TUCA_Sa01/ TUCA_GV08 9/26 5:30am to 9/30 16:15 missing 418 (3487-3069) TUCA_MW01 PST 9/26 8:15am to 9/30 16:15 TUCA_MW12 PST 9/26 8:15am 9/30 16:15 nh_LCC45 Missing 9/26 7am to 9/30 16:15 439

I have some field system logs for some of the devices that I can check for a verified 201 handshake if that would be useful.

The visual Visual of lost data as described here https://www.envirodiy.org/topic/monitoring-power-consumption/#post-18197

Might be related to #605

SRGDamia1 commented 8 months ago

I'm also seeing a gap there for my sites in the online data. But there's no gap in my local record of data that I scrape from the website every few hours. So, the data did make it to the server, was available for download, and was later lost.

@neilh10 Your loggers were likely getting 201s, so there's no reason to dig through their logs.

@ptomasula can you please look into this? I suspect the gap must be caused by a database migration related to the StreamWatch Schools update or the hotfix on 9/26.

SRGDamia1 commented 8 months ago

I do not think this is related to #605.

neilh10 commented 8 months ago

@SRGDamia1 thanks for the insights. It looks like we both have ways that we are trying to verify the life cycle of the data. Just an fryi to who over looks into it, I have had some devices (NA13 and GV01) with connectivity issues, that got delayed in delivering their data, and subsequently are now delivering the data - so there is likely to be some devices with data in that window.

I mentioned https://github.com/ODM2/ODM2DataSharingPortal/issues/605 as it seemed there was never any investigation done into it - but maybe its just to old to consider now.

aufdenkampe commented 7 months ago

@ptomasula and I will look into this. It's definitely related to our database migration process.

Please note that in every one of our releases in the last few years, we've been trying to eliminate legacy code and database issues, and that always requires migrating databases. We've been increasingly automating the process; it used to take us many hours and lately we've done it in 15-30 minutes. In the next release, we hope to reduce that time even further to a couple of minutes. To date, when we do this, we don't have a database available to receive the data, so all data arriving during that time is lost. That will no longer be an issue once we implement Simple Queue Service with #688, sometime this spring.

With all that said, the data lose in this issue must have resulted in a hot fix we made, which is a separate issue than our normal release workflow. We can look at it a bit closer to figure out if we can restore that data.

neilh10 commented 2 weeks ago

I'm closing this as I believe the reasons for the lost data has been identified, and there is better active monitoring . It probably can't be guaranteed from happening again, however there is a lot more discussion about possibly failure modes.