Reliable Delivery model algorithim

ODM2 / ODM2DataSharingPortal

A Python-Django web application enabling users to upload, share, and display data from their environmental monitoring sites via the app's ODM2 database. Data can either be automatically streamed from Internet of Things (IoT) devices, manually uploaded via CSV files, or manually entered into forms.

BSD 3-Clause "New" or "Revised" License

31 stars 8 forks source link

Reliable Delivery model algorithim #485

Open neilh10 opened 4 years ago

neilh10 commented 4 years ago

I'm just wondering what the "reliable delivery" algorithm parameters could be from the server interface and the client interface. So putting this out for discussion.

For software systems with a core communication element, communications's reliability is a system level characterization. It is often a target, and is best to be quantified and agreed upon, the system characterized over time, and repeated releases of software.

For a wireless network, which is inherently unreliable due to the complex nature of wireless signals, the complex nature of the wireless footprint, the complex nature of the environmental conditions that effect the wireless connectivity, a "reliable delivery" algorithm is key to confidence in the value of a remote device.

A "reliable delivery" algorithm is that set of parameters that a client Mayfly should execute and responses, for the successfully guaranteed delivery of that data over a network to a level of specified reliability. The reliability testing could be high enough to make it meaningful, and low enough to be a reasonable test case. Reliability is typically defined as the message that could be lost and still meet that bar. 99% one message in 100, 99.9% one message in 1000 .... To be able to say you have met a standard of reliability (eg 2 9's or 99%), twice the number of messages need to be transmitted, with only 1 lost.

As a straw poll, for the client Mayfly I would suggest the following: On a client Mayfly POSTing, if a response isn't normally received in 2 seconds by that client, that it consider the gateway to be in timeout. This is generally in line with the characterization data I have seen of this date. The timeout time directly effects the power draw, so there is a tradeoff. If the wireless network is unavailable, there is a benefit from shorter timeouts until it comes back. For characterization purposes a client gateway may set a a gateway timeout as 5 seconds to determine what range of response the specific network the client gets. The gateway timing starts from after the data has been transmitted to the Modem, which is currently a slow 9600baud link, and the network is typically much faster 100MegaBaud

The reliable delivery algorithm, is such that if a POST returns a SUCCESS (HTTP STATUS 201 CREATED) it will be considered delivered. If any other response is received, including none, the POST will be repeated at a later time, until a SUCCESS is received. For any client POST attempt they shall not repeat more than 60messages on any single connection attempt in any one hour period. The any client POST attempt at one connection they shall not POST more than 10 messages after the first unsuccessful indication. The reliability target for releases of client and server software should attempt to be 99%.
Longer term field reliability can attempt to characterize a system to 99.9%

For comparison purposes; i) if a Mayfly unit that is being characterized in the field with a wireless network of variable reliability , with a sample time every 15minutes, and an objective of 99.9% reliability it needs to have 2000 messages, or 21days ~ three weeks - with only one message being lost. There could be significant number of retrys, and needs reliable software to be able to characterize a potentially lossey network. ii) if a Mayfly software is UNDER TEST with a reliable network, an dit generates a message every 2minutes, and uploads every 2 messages (that is 4 minutes) it will be able to reach an objective reliability of 99% or 200messages in 400minutes. or 6hours 40minutes - comfortably an overnight test that can verify the combined software under good network conditions.

SRGDamia1 commented 4 years ago

I love the idea of having a standard for this, but I've never done the work you've done to test it.

neilh10 commented 4 years ago

Yes I've got lots of gray hairs from having to meet targets of 99.999% or 5 9's reliability .

Generate 200,000 messages or telephone calls and only have one fail for the test to pass. Ahhh! Sometimes there was a chain of 5 or 6 little Mayflys (or timy 8051's) for all the messages to pass through, something like I2C to the Modem Controller, then the Modem software itself, network .......

The reliability was a very visible characterization, and sometimes took some custom test beds and setups to achieve.

What worked, was to be able to easily start the test going at the end of the day, and then the next morning check the numbers easily, and spot any potential issues that might creep in.

I believe it also makes it easier on the server side, as it gets to do a defined server loading burst that it keeps it ahead of actual real data, and make it possible to see potential problems earlier. Of course for MMW, I would think its nice to delete the test target periodically, and remove the test readings out of the data base, and not clutter it with test data.

I was thinking of proposing a standard test setup for a Maylfy that maybe could go under ModularSensors/tools/tests - that would be easy to build and deploy.? For easy free bandwidth, WiFi works for me, using a XBEE WiFi S6, but it could also include other reference radios as needed :)

neilh10 commented 3 years ago

Since there is no comment on this from the server side, I propose any significant delivery errors are flagged.

aufdenkampe commented 3 years ago

@neilh10, I also love this idea. On the server side, we'll unfortunately need to wait for funding. The good news is that we have an excellent proposal pending, so hopefully that starts sooner than later.

neilh10 commented 3 years ago

Great. I guess this is a marker for anybody thinking about it as to what does "reliable delivery" look like from the server side. Possibly the difficulty is defining the test,
I have a version of Reliable Delivery working on the Mayfly side on my private branch, and would be happy to build a test system/build for anybody looking at this. https://github.com/neilh10/ModularSensors/releases/tag/v0.25.0.release1_200812b and happy to submit it back to enviroDIY https://github.com/EnviroDIY/ModularSensors/issues/194 when there are resources available.

So a data point; in recent tests last week, when testing "low battery" condition, and ended up with about 1000 outstanding readings, I was seeing that the number of POSTs in one session where varying between about 8 and 30, before the server side timed out. They all got delivered successfully which is really nice (https://github.com/ODM2/ODM2DataSharingPortal/issues/489)

neilh10 commented 2 years ago

An observation - Mayfly Xbee S6B WiFi became unstable https://github.com/EnviroDIY/ModularSensors/issues/347 for my testing in Jan 2021 after working exceptionally well earlier. It appears that generally the TCP/IP link setup/teardown that had been working before stopped working after the first sleep event. Its complicated to see what is going on the physical line.

The issue for all "client" POSTing to ODM2/MonitorMyWatershed is what is the model for POSTing, and then being able to characterize it on the device side. Investigating the WiFi "communications driver" TinyGsmClientXbee" , it is interwoven with other TinyGsm clients. It seems likely that a change elsewhere either in TinyGsm code, or on the ODM2 timeouts caused a problem on WiFi

The Xbee WiFi S6B device has a limited method of setting up and tearing down TCP/IP links. A number of hacks are tried to manage the TCP/IP links toMMW. The problem is that this may beat on the server a lot. The TinyGsmClient model is to have one TCP/IP link setup and tear down per POST and RESPONSE. For multiple messages this may not be the most efficient, and may not make reliable data delivery from all the Mayfly Clients scaleable. It could be that the beating on the server, caused ODM2 front end to be starved for resources, and some fine tuning changed the way the TCP/IP .

So this is just a plea, based on hours of debugging, for documenting what is the ODM2 model for accessing the server, with all the internal network facing timeouts identified.

neilh10 commented 2 years ago

In this update I'm identifying what I understand is the state of different systems to support a "reliable delivery algorithm"

The target I suggest is that for data recorded on ModularSensors/Mayfly for 1year at 15minutes time intervals will be successfully delivered to MMW. One year of data is 96records * 365 days or 35040records . Statistically, if twice the data is delivered and only one record is dropped, the target has been met. That is it needs to be tested for delivery of 70,080 messages and one failure, undelivered reading is permitted, or 99.997% reliability.

The release v0.12.x is now production on the AWS servers and live for POSTs to data.enviroDIY.org. Visualizations of these readings is through https://monitormywatershed.org/ (MMW)

A sensor node, typically running ModularSensors/Mayfly takes readings and periodically POSTs these results to the server and then they are accessed and downloaded through MMW.

All POSTs that are accepted into the database or already in the database receive an HTTP 201. (per resolution in #538)

The term "reliable delivery" is from functionality discussed in these issues https://github.com/EnviroDIY/ModularSensors/issues/201 https://github.com/EnviroDIY/ModularSensors/issues/198 https://github.com/EnviroDIY/ModularSensors/issues/194

The released https://github.com/EnviroDIY/ModularSensors/releases/tag/v0.32.2 is a best effort - that is POST and if everything works well it should be there. However there are many potential situations where this may not be result in data arriving. Communication channels and servers cannot be guaranteed to operate perfectly.

I've been working on a forked version of ModularSensors that include a reliable delivery - I designate this forked version as azModularSensors - and initial functionality for reliable delivery was added in https://github.com/neilh10/ModularSensors/releases/tag/v0.25.0.release1_200906

The reliable delivery algorithm is implemented in the azModularSensors and on the server, providing the server is up and there is a reasonable transmission medium, the readings are transferred to the server. There can be transmission failures, but providing the server guarantees that a 201 response means the reading has been inserted into the database, or already exists in the database, then all readings should be transferred.

This reliable delivery is configured through a local ms_cfg.ini file that is read on startup by azModularSensors. The following reliable delivery parameters can be changed, without having to rebuild the load

[COMMON]
LOGGING_INTERVAL_MINUTES=2 ; aggressive testing, typically though every 15minutes

[NETWORK]
COLLECT_READINGS=5 ;Number of readings to collect before send 0to30
POST_MAX_NUM=100 ; Max number of readings to POST after 
SEND_OFFSET_MIN=0 ;minutes to wait after collection complete to send

[PROVIDER_MMW]
CLOUD_ID=data.enviroDIY.org
TIMER_POST_TOUT_MS=7000; Gateway Timeout (ms) depending on medium
TIMER_POST_PACE_MS=3000;

aufdenkampe commented 2 years ago

@neilh10, think it is worth changing your host to monitormywatershed.org, and also removing any references to our old IP address, as it seems that maybe even your newer sites are somehow being affected by maintenance on LimnoTech servers and not fully benefiting from AWS 99.999% uptime.

See my explanation here. https://github.com/ODM2/ODM2DataSharingPortal/issues/542#issuecomment-998074122

aufdenkampe commented 2 years ago

@neilh10, now that we have completed release 0.12 to AWS, we're finally in a position to start working with you closely on your proposed Reliable Delivery Model Algorithm.

That said, there is still a lot of tech debt for us to work on for release 0.13, which will all substantially help toward our collective goals for reliable data delivery. So we may not be able o fully address your proposal with 0.13 even if as we are working with you toward that goal. We very much appreciate your detailed suggestions and error reporting, and appreciate your patience with the amount of time it has taken us to get to a position where we might start working toward your proposal.

neilh10 commented 2 years ago

Based on #543, a reiteration the intent of the reliability delivery algorithm is to throttle if the server isn't responding, and follow normal comms industry practices of making it configurable for each IoT device and potential servers.

Partly the purpose of this discussion is so that a (future) large number of ModularSensors/Mayfly's, properly configured will work gracefully with the server as traffic scales.

Happy to have any comments/feedback what works better for the server side - and absolutely the intent is for individual ModularSensors nodes, through easy statistical configurations, to reduce future costs on the server technology as its scaled.

Under a "standard" data delivery, with local power savings, my fork of the ModularSensors is configured thru ms_cfg.ini to take a reading every 15 minutes (could be faster), and queues them for delivery every 4 reading period (4x15min=one hour) in a local file RDELAY.TXT.
Then when the defined number of readings has been collected it waits a further 3minutes/SEND_OFFSET_MIN before attempting to send the delayed readings to the server. This spreads the arrival time for the server, thus possibly also saving the Mayfly power. When connected to the server and sending, it POSTs each of the 4 readings irrespective of the http status, based on past experience that the server often records the reading, and allows a user to see the latest successful POST .
The sending algorithm paces each attempt between readings by 3000mS/TIMER_POST_PACE_MS.

For every readings that doesn't receive an HTTP 201, it then writes it to a file QUE0.TXT

For the last reading (RDELAY.TXT), if it receives an HTTP 201, it then checks the QUE0.TXT file for any queued readings. If there are any, it reads them from QUE0.TXT and POSTs it, with the 3000mS/TIMER_POST_PACE_MS delay. It continues reading/POSTing so longer as each message receives a HTTP 201, until the QUE0.txt is empty OR 100/POST_MAX_NUM is reached, in which case it ends the connection, and rewrites QUE0.TXT with unsent messages These are configurable on each Mayfly with an ms_cfg.ini:

[COMMON]
LOGGING_INTERVAL_MINUTES=15 ; Collect a reading

[NETWORK]
COLLECT_READINGS=4 ; Collect 4 readings and then deliver
SEND_OFFSET_MIN=3 ;delay after collection complete to send
POST_MAX_NUM=100 ; Max number of Que readings to POST

[PROVIDER_MMW]
TIMER_POST_TOUT_MS=25000; Gateway Timeout (ms) 
TIMER_POST_PACE_MS=3000; between readings

neilh10 commented 1 year ago

Update on https://github.com/EnviroDIY/ModularSensors/issues/194

So theses notes I have gleaned are for any recipes that anybody is stewing on. Any opinions are only mine, no QA implied.

So seems there has been a monitoring of the server for the last 9months.

Processing the POSTs are serialized and take the time - essentially the validation of the message UUIDs. There is no advantage to multiple POSTs on same tcp/ip link from ModularSensors/Mayfly. With batched queue algorithm ModularSensors code does POSTs on the same connection to the MMW, but processes each POST the same old way, which I think establishes the TCP/IP link, POSTs, then tears it down. It seems unlikely to provide any benefit to the server with tinkering with the algorithm.

POST time Offsets : suggestion between 3min and 8min ( seems to be assuming a 15minute window.) POST offsets seem to offer the best method of distributing the loading across the hour. Though obviously the Mayfly's need to do a statistical distribution. Possible further work on the Mayfly devices.

Pacing between POSTs on the same connection : not likely to save much time. – depends on server loading. See serialization. IMHO seems like it needs characterizing between 100mS and 500mS. I have tried 1sec and 2sec to try and be friendly, however not any difference.

POST Timeout : define it for the device settings - based on characterizing the servers response. Server internal setting currently 5minutes to provide for internal optimal guarantee of insertion in database. Not possible to sensibly set for device.

Seems plans for improved response times to device POSTs only likely to improved with AWS Simple Queuing Service. (but maybe MQTT could be better long term solution)

Max Number of POSTs on any one connection : Any number of greater than 0 has the same effect and is serialized, so needs some characterization. It does of course impact power.

One other parameter not figured out, and needing feedback is the timedb compression" algorithms. A component is For "older" readings they can be compressed. https://www.timescale.com/features/compression

The Mayfly of course uses less power if it connects to MMW less often - and is the product of the sample time (eg 15min) and the number of samples to collect (eg 4) for a connection attempt every 60min.

However for some edge cases where readings have failed on first POST or conditions delayed connection eg #658 & #661 the timedb may have compressed the endpoint data. When new data is POSTed it has to be uncompressed, which of course could take a long time, and then the Mayfly times out.

The algorithm will of course try in the next time window - which could be 1hour or ? but will the timedb compression algorithms have kicked in. (can chatGPT arbitrate this condition - oops still need people to do that)

Does this describe what was seen with this field system

245198825-56ee8383-151d-4561-abc6-b66244794ca3

So for an open source, the integration testing I've been doing provides an effective characterization source of published data ~ latest on https://github.com/ODM2/ODM2DataSharingPortal/issues/661

I've gleaned that there is some serious noodling on how data flows on MMW and thanks so much for the detailed effort to improve response times.

Whew! Note to self, well done for proposing this sensible network layer "Orderly Data delivery" in 2018 when reviewing this wonderful effort of ModularSensors open source code. The value of checking out the architecture of open source code.
Of course I was bit of laggard taking two years to flesh out the impacts on the server with this issue #485 Aug 7, 2020 .

neilh10 commented 7 months ago

Refreshing on the issue of the effects of database compression.

For a normal type of field issue, where a field systems gets behind in sending readings to the cloud, and may not have a person do a field visit for some months., When it starts POSTing again, it may not be received according to the algorithm that is utilized for the ODM2 database compression. Described here https://github.com/ODM2/ODM2DataSharingPortal/issues/665

aufdenkampe commented 7 months ago

Answered here: https://github.com/ODM2/ODM2DataSharingPortal/issues/665#issuecomment-1930324572

neilh10 commented 7 months ago

@aufdenkampe many thanks for the quick response.

As this is an open source discussion, this thread has been a way of phrasing "edge conditions" - conditions that aren't thought about in the beginning and then have consequences later, and a software hero has to be found to then decode and solve.

As an engineer, I'm just saying that part of the of discipline taught to students, is identify likely edge conditions "Requirements" and then also what needs to be done to test for it.

Restating with data from #665, the architectural challenge for the server with field systems that get behind and then start posting to the server : "The effect of a post of data into a compressed time chunk is that the server needs to use substantially more resources to insert the data, because the entire chunk needs to get decompressed, appended, then compressed again. "

The architectural solution on the sever " #688 spreads out the work load. We don't mind our server doing the work. The issue is just getting too many posts at once leads to the server getting overloaded in the minute when they all arrive. We have a ton of time where the server CPU is idle. The point of SQS is spread the work into that idle time."

This has an uncalibrated assumption built into it.

aufdenkampe commented 7 months ago

@neilh10, SQS, API Gateway, and the equivalent is current best practice in Cloud DevOps. There are no untested assumptions that it will provide substantial benefits.

We understand software architecture and test driven development and are doing an excellent job refactoring and improving the legacy code stack we received 3 years ago while enabling the system to grow from receiving 6 million data points per month to the current 16 million per month.

neilh10 commented 1 month ago

It seems to me with all the work that has gone on in characterizing the servers, and in line with standard communications stack architecture the API for successful ACK, is a response code of 2XX as per https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

For all other response code, in line with standard communications architecture, the devices posting need to retry until they receive a successful ACK.

I hope that after the amazing work that has happened over four years on understanding the code, that this won't cause any unexpected problems.

If there are any suspicions of potential turds to slip on - be great to hear about them.