ODM2 / ODM2DataSharingPortal

A Python-Django web application enabling users to upload, share, and display data from their environmental monitoring sites via the app's ODM2 database. Data can either be automatically streamed from Internet of Things (IoT) devices, manually uploaded via CSV files, or manually entered into forms.
BSD 3-Clause "New" or "Revised" License
31 stars 8 forks source link

Characterization POSTs from 2023 Aug wrt to MMW #673

Open neilh10 opened 1 year ago

neilh10 commented 1 year ago

I'm wondering if the MMW characterization data is being released, as has been previously discussed. :)

I'm seeing one site that was doing well, suddenly degrade since the Aug 11th. I'm wondering if there is any MMW indications as to what maybe happening. https://monitormywatershed.org/sites/TUCA_GV01/ Up to 2023-08-11 10:45:00 PM the hourly POSTs are all up-todate Since 2023-08-11 10:45:00 PM there is only sporadic data delivery and it appears to be falling behind - that is 96 data records a day, being generated. As of this today this morning, only 86 have been received out of a an estimated 1240 generated readings for the period . The site is set to deliver every hour, and up to 100messages if any outstanding. However I never see MMW even under the best of circumstances able to accept 100messages. The site is from a riparian zone by a stream up a winding channel in hills over verizon wireless. When connected it indicates a good signal strength of -81db. However often the tcp/ip connection attempt to MMW fails, when connected to the cellular network and attempts a TCP/IP connection to MMW.

This is similar to what seemed to happen to https://monitormywatershed.org/sites/TUCA_PO03/ in January this year. An annotated download file is attached TUCA_GV01_TimeSeriesResults230824_0908annotated.xlsx

Many thanks for any insights

ptomasula commented 1 year ago

@neilh10 I can provide the uptime monitoring summary for the last month, though I suspect it will be slightly less helpful in answering questions on API performance. This one is monitoring general site uptime and not the specific API endpoint used to ingest data. We do have a monitor set up for the specific API endpoint, though it looks to have been suspended and not gathering data for the last few weeks (though I just reenabled it). There are no notable latency spikes around the 11th indicated by our monitoring service, but we did have a degraded performance indication on the 14th. We do look to have a sharp improvement in latency around the 8th, but I would need to cross check some other logs to see if we could attribute that to something specific.

image
ptomasula commented 1 year ago

I would attribute the response time decrease on the 8th to a database restart. Looks like we got some resource alert notifications on the 6th and 7th, consistent with the yellow performance warnings on the uptime monitoring and rebooted the database service on the 8th at 10:53 EST.

neilh10 commented 1 year ago

@ptomasula many thanks for the quick response. Much appreciate showing the data you collected, and UP time performance suggests that site is healthy. Oh well on the API performance - good to hear restarted and maybe get to see if it shows anything.

What I'm interpolating, from download.csv of TUCA_GV01, is the POST is getting a time out of 15seconds pretty quick when it connects . Of the 7 connections visible that succeeded, 6 had all 4 POST readings that made it to the database. 1 only had 3 POSTs accepted.

Since I started manually monitoring it two days ago, I took two download snapshots .csv; of the latest successful connection with a connection of 4 POSTs the last of reading 2023-08-23 1:00:00 AM PST, and in that connection it appeared to also have catchup posts of 21 POST readings to 2023-08-11 2:45:00 PM before it hit an error. The Mayfly device ModularSensors would have capped sending at 100readings, but instead only did 21.

The data that is not visible is how often does it try to connect to cell tower, and fail, versus connects to the cell tower and then fails on getting a tcp/ip connection.

I can take an MMW download of sensor data.csv snapshot tomorrow, and if its not good possibly drive to the site and extract the DBGyymm.log file.

I had another local test system turc_test08 running over verizon till Aug 16 and its all been healthy with 201's a few 504s and 400's - and YES can visibly see response time decrease after the 8th image

Though some interesting other data - 1st POST get a fast response 2.7Secs and subsequent POSTs are 6+ seconds! Also the connect times start going out after the 8th image

turc_test08_DBG2308_log.xlsx

aufdenkampe commented 1 year ago

@neilh10, do you use Hologram for your cellular SIMs? If so, they have a great dashboard that can help you understand how often each modem connects and how much data is transmitted per connection. I imagine other cellular IoT providers have that info too. It could be an important set of data for you to better understand what is happening for these deployed stations.

neilh10 commented 1 year ago

@aufdenkampe in the geographical area I'm looking at Verizon largely has coverage and the VAR has come through dataplans.digikey.com.
I agree its a valuable to have a good dashboard, and I started with using Hologram. They still list "Verizon a custom plan". So I've had to go with an option that supports Verizon. dataplans.digikey.com dashboard is pretty simple, only estimates data usage image

I've just checked a number of sites I have, and it doesn't appear that Mi06 has sent a "Site Alert" on not receiving data - oh well, try and make work what you can with what's available. I'm going to try and visit the sites to collect their DBG logs.

neilh10 commented 1 year ago

Hi @ptomasula just wondering if there is any data on the performance. (Sorry only contact when there are issues - would be nice to be able to just step into a link to get a perspective) For TU MW12 it was in a Verizon shadow with initial modem installed. I changed it to ATT on Tue 12. https://monitormywatershed.org/sites/TUCA_MW12/ While monitoring it started to upload readings perfectly with 100 uploads with 201's first pass. However since then it appears to be downhill The data point is it had 9021 records from 2023-06-05 9:10:30 AM to 2023-09-07 8:00:00 AM Last night from 11pm to 7am - connecting every hour at 47minutes past the hour, 8 connections - it only uploaded 32 queued readings - if the connection and server are working perfectly it could have uploaded 800 readings. Some 8600 readings queued It could be the ATT connection but the RF strength readings are -69dB which is very strong.

A test system that I am monitoring, uploading two records at a time, over ATT/SIM8070G is also getting a few 504's thanks

ptomasula commented 1 year ago

@neilh10 Here is the raw API monitoring data from the last 24 hours. There is a notable spike in the rolling response time around 11:20 pm, and another around 4am, but then it looks like have settled back into our typical pattern. If I aggregate to hourly I can compare against our the last 7 days. The current peak rolling hourly average for today is 16.2 seconds. That's a little higher than it was for periods of time yesterday, but still not the highest it has been in the last 7 days.

Last 24 hours raw response

2023-09-08_monitor_last24_raw

Last 7 days hourly

2023-09-08_monitorapi

I can see about to what degree I can make the monitoring publicly available, but that will take some time to look into that how/if I can set that up.

neilh10 commented 1 year ago

Hi @ptomasula many thanks for the overview and reassuring to know the server loading is looking good.

I ran a higher rate test with the same 9000 queued readings I had taken from TU_MW12 and new software module over ATT/SIM7080 uploading from my outside test bed with good ATT connections and it did very well. Verifying the module integration and the other layers. It started 2023-09-08 15:51 , uploading every 10minutes, attempting 100 records sending every 0.5secs, taking some 7minutes, and coincidentally completed 2023-09-09 15:18. Most of the 100 record uploads completed successfully. So excellent news on pacing. So for reliable communications it could be considered there are 5 areas that can result in higher failure rates- payload size (invariant in this case), sender retry algorithm errors, cell phone tower connection, end2end communication channel including wireless channel noise errors, and receiving server algorithm errors.
For TU MW12, it was still slow over this period, it looks like it must be the communications channel errors that are slowing it down . This re-try's every hour, and then terminates on the first non-201 that it is generated.
It does make me think that an enhanced sender retry algorithm would now be possible - https://github.com/ODM2/ODM2DataSharingPortal/issues/485