ODM2 / ODM2DataSharingPortal

A Python-Django web application enabling users to upload, share, and display data from their environmental monitoring sites via the app's ODM2 database. Data can either be automatically streamed from Internet of Things (IoT) devices, manually uploaded via CSV files, or manually entered into forms.
BSD 3-Clause "New" or "Revised" License
31 stars 8 forks source link

Characterization POSTs for 2024 Jan-May #724

Closed neilh10 closed 1 month ago

neilh10 commented 2 months ago

I'm doing integration testing from my desk and seeing a high rate of POST timeout failures.

I'm wondering if the MMW characterization data is being released, as has been previously discussed. :) #673

This is a follow on from #667 #673 #661

I pulled the .LOG I keep from my LCC45/WiFi node, and it is showing a large number of POST failures.

As an overview - there are three main areas of failure with the ModularSensors/ODM2 a) ModularSensors running on Mayfly b) Wireless links c) the host MonitorMyWatershed.org running ODM2 For a) I have an enhanced reliableDelivery ModularSensors fork that has very solid repeat of messages that aren't acknowledges. Very standard 101 communications theory.

b) Wireless is inherently unreliable, and depends on a host of unpredictable factors including geography, fog, wind direction . ModularSensors uses large packets with redundant UUIDs that make it even more unreliable - who ever designed it didn't understand the extra challenges in riparian zones with vegetation. A better architecture for wireless is slimmed MQTT.

c) the host system. This would be expected to be "reliable", and this report is focused on detected inability to process messages as a timeout "504"

The LCC45 system delivers over WiFi and here is the graph of HTTP responses (number of POSTS on left). Since its WiFi and the communication medium is good, ideally I would expect the number of failures "504" to be 0. LOGGING_INTERVAL_MINUTES=15 SEND_OFFSET_MIN=2 - POSTs at 2minutes past the 15min interval TIMER_POST_TOUT_MS=13000 - 13secs timeout When there is a "201" it is usually returned in under 3seconds

image

Full data attached 240528_Lcc45_responseAnalysis.xlsx

ptomasula commented 1 month ago

@neilh10 Thanks for providing this characterization data! As requested, we have finally pull together a dashboard from our uptime monitoring service. That have been made available as bit.ly/mmw-uptime

Our monitoring graph seems to track pretty well with this characterization. We see the same bump in performance during March followed by some rough spikes in April and May. I noted a couple of notable maintenance activities on our graph below. We are looking forward to getting the rest of the performance improvements implemented to further improve stability.

image
neilh10 commented 1 month ago

Great to see. Thankyou very much. Looks like its a good tool for regression testing and enable some quantification of the effect of changes. !! :)

Interesting. For the data base it does show a challenging side of managing "big data", that it needs to be monitored.

Is the "DataStream" the response that a virtual end-point/devices sees? - that is from the internet. Setting the top right to raw, there are a lot of 15sec response times. image

ptomasula commented 1 month ago

Correct, "DataStream" is a virtual device which makes a post request to the same /api/data-stream endpoint used by the physical devices.