Open aufdenkampe opened 10 months ago
@aufdenkampe thanks for identifying the component upgrade and curious as to a) what throughput target this might have? b) whether or not there is likely to be a loss of readings of MMW in the upgrade process?
a) Throughput target : the SQS architectural component is driven by current architectural limitations and while it would be expected that the method would increase throughput, the SQS is as I understand it is a costly component, solving a valid problem. As it scales will those increased $ costs be accommodated in the current end-user cost model?.
For a defined throughput target, the system can be characterized for that target and a technical discussion had.
The benefit of characterizing is that the development process can be simplified. With past methods of MMW rollout, I've found that after an MMW upgrade (which the user base has had no visibility on) and I do some simple testing with limited resources, I've found issues.
From past experience its probably been "expensive" to bring the resources back to fix the issues.
All software components need testing.
From long experience Software architectural component can be aspirational on paper and declared good with minimal testing - or fail with unrealistic characterization expectations.
As is also noted the database server has a threshold of processing, and so balancing the two becomes part of the art of characterizing each component and tuning them together.
For purely comparison purposes - Thingsboard using MQTT have a defined 3M msg/s (on a defined hw) - and they are dealing with a more complex publish/subscribe - however no expensive UUID processing - so 3M msgs/s is an amazing number that likely would be hard to reach - but it is their proud claim of what they are achieving (somehow) at a low cost. https://thingsboard.io/docs/mqtt-broker/reference/3m-throughput-single-node-performance-test/
b) will the upgrade process to MMW always cause some loss of readings?. Is there an architectural reason that there is always going to be a loss of readings when rolling out MMW upgrades - whether that is a few hours or 5days as in the last loss #685 and previous #685 . I have a couple of other situations where I have lost readings, and I just haven't chased them down.
Of course if I'm asking unreasonable questions :) - please feel free to ignore this.
I'm not sure implementing SQS right now is the right way to go. As I put forth in various meetings and my PR, I think there's a lot of easy wins in the database code.
Mixing in more AWS services like SQS increases cost, reduces control, and might make development and alternate instances needlessly difficult.
I don't know your internal timetables, but I think the code should be cleaned up and tested before this is added. It's hard to tell how you all can monitor performance or test with a real load but I've done synthetic tests internally to come to this conclusion.
I see that there's a staging site but is it possible to do any development work locally? Is that a strategy your team uses?
I'm also concerned if SQS will require any changes in the devices? Will they access a different endpoint or use a different data format?
Implementing SQS shouldn't require any device-side changes.
@tpwrules and @neilh10, thanks for your thoughts on SQS.
First, as @SRGDamia1 mentioned, the enduser device won't need to be changed at all. The endpoint will be identical and the payload doesn't need to change. The only difference is that devices will get a near-instantaneous 202 Accepted status code from SQS.
Second, we first plan to implement PR #674 Batch upload & other performance enhancements before SQS. That's a critical improvement that makes everything work better.
Therefore, implementing SQS really amounts to an optional additional layer of security (to protect from a massive set of requests bringing our server down) and potential opportunity to aggregate and bundle single value requests into the kind of batch requests that PR #674 enables. SQS also gives us additional diagnostic monitoring, logging, and routing options (i.e. if we every get to the point of having multiple database servers for redundancy, etc.)
I summary, SQS is additional scaling capabilities for future-proofing.
@neilh10 and @tpwrules, to answer your questions about SQS costs, they are negligible. The first million messages per month are free, and then it costs $0.40 per million messages per month. See https://aws.amazon.com/sqs/pricing/
In October we processed 16M messages, so that would add $6.00 to our monthly bill.
@neilh10, to answer your question B about lost readings during a release, SQS will completely eliminate those, as the posts will aways be in the queue until they can get processed. That's another advantage of SQS that I forgot to mention.
Has thought been put in to how SQS will tie in? Unless devices can hit the SQS endpoint directly then I'm not sure this will help. If devices still need to go through the application code that will be a problem.
@tpwrules, yes, the whole point of SQS is that the devices hit it first.
A benefit with our AWS hosting (which is optional and transparent to the user) is that we can assign our "Elastic IP" address to any service on AWS. So switching to SQS will not require us to change our IP address and it won't affect DNS routing at all. We do this all the time when we issue releases. We actually swap servers by reassigning the elastic IP, and the change is instantaneous.
@neilh10 will remember when we first switched from on premises hosting to AWS that we did have to change our IP address, and many devices did not automatically reroute their traffic, because they would't refresh their DNS cache until rebooted (which didn't happen more than a year for many devices). We had to instruct users to visit all their sites to reboot. This problem will never again occur, because of our reserved Elastic IP address will never change.
@aufdenkampe thanks for the data points, good to hear the $ is reasonable.
Well I had to do some digging in my code & comparison with (main) - since I'm implementing a reliable delivery mechanism - first suggested in Aug 2020 #485.
I'm mentioning the date, as I have attempted to be proactive with issues that from other projects have been architectural challenges for a distributed system.
The (main) code POSTs - and if it gets a timeout, it generates an internal 504, if it get a response from the server, it parses it, prints it for any humans monitoring debug.
My reliable delivery implementation adds a layer that records the http status in a local debug.log file - then it checks for a 201 and if it doesn't receive it, it queues for retry. Of course to access the debug log (DBGYYMM.txt) can at present only be done with a site visit. Field data has shown lack of 201 can be caused by no cell phone link (cell tower not responding), call accepted but can't get a TCP/IP connection, timeout from the POST which can be the channel or server, or other error responses from the server.
So as suggested, implementing SQS causes no change in the POST mechanism, but will result in a 202 response.
For my ModularSensors device code, processing a 202 is no big deal in code - except it would have been nice if it could have been visible - https://github.com/ODM2/ODM2DataSharingPortal/issues/485#issuecomment-718793773
From an architectural view, I wonder if there could be other 20x that might represent the server taking ownership of the record. https://developer.mozilla.org/en-US/docs/Web/HTTP/Status
So appreciate the heads up.
If its possible I would like to get visibility when its introduced to a staging server so I can test against it.
You never know what else might show up.
It seems I have to plan to upgrade the ModularSensors device software of all the 10 systems I have in the field - 2 of them run over wifi and 8 over LTE . For some systems, on private land, I have to fit in with regular maintenance visits by the team.
Just wondering if there is a rough schedule of how this might be deployed, and can it be made available on a developer instance first. Many thanks for considering how I can test the update early, so I can prepare for an upgrade to my 10 systems.
Just wondering if there is a rough schedule of how this might be deployed, and can it be made available on a developer instance first. Many thanks for considering how I can test the update early, so I can prepare for an upgrade to my 10 systems.
Our rough timeline is April to May for SQS implementation, but we are first prioritizing the integration of performance improvements and batch support (#649) that @tpwrules put together first.
Regarding testing, yes we can make this available on our staging.monitormywatershed.org instance (presently offline following our last recent release) prior to this new release. We will also be testing against that same instance. One import note is that we do NOT back sync any from the staging instance to production. So any data set to the staging server is wiped out when we actually issue the release.
@ptomasula - many thanks for the rough timeline. I just need to meet with it for the upgrades of my systems.
Absolutely the staging server is about testing the functionality and any data generated is throw away. Im also assuming that any "crowd sourced testing" against staging doesn't count on the Subscription model.
However the subscription model for the main server doesn't enable for "crowd stability testing" once it is merged since only one free subscription is allowed - I guess on the assumption that the user is only testing their code https://shop.monitormywatershed.org/product/subscriptions/
Its well known in software testing, that the lack of effective functional regression testing, typically means that any bugs slowly dribble out, and require heroic efforts to fix them. Software's fitness for use is usually calibrated against bugs found in a full regression test. So the challenge becomes planning how many regression tests are performed and the cost of testing.
I've planned large system integration updates in the past, and if it was me I would stage and sequence the performance improvements separately from the batch JSON. and the SQS. Preferably allow a soak time for each set of changes. In my experience the "big bang" approach of throwing all the changes in at once is very expensive to manage. (lots of bugs). It was a management philosophy of 80's and required a lot of effort to recover system stability ie failed. It was used by the FBI in their system upgrade, and failed completely. System reliability is a very challenging metric to maintain.. I've had a lot of good experience with a team (of paid) developers all coming together to effectively sequence the regression with known functional changes, but it required good visibility of the timelines in advance.
From my point of view as an independent developer and "crowd tester" the three stages are relatively easy to apply a regression load to, if they are visible.
Though as I've said elsewhere, the majority of the reach for a cells range is low signal strength. In urban areas more cells are installed so better overall cellular coverage and signal strength. For rural areas cell coverage less likely. Larger JSON packets over cellular wireless have less chance of delivery for low signal strength but better chance of being processed by the server if it is loaded. SQS reduces the chance it is loaded. I've had had some great success optimizing a few systems to deal with what we found to be low signal strength as some geographic locations. These are in deeply wooded incised riparian zones and it appeared that signal strength varies across the seasons. https://www.envirodiy.org/n-ca-mayflys-through-the-winter-storms/ Paper on packet error rate v length https://onlinelibrary.wiley.com/doi/full/10.1002/dac.4115
Happy to help where I can.
Update: due to budget constraints, we're unfortunately not going to be able to fit this into our upcoming v0.18 release. We're all sad about this, but we'll get there eventually.
It seems to me, with "white box testing" (ie smarter) this is a benefit.
IMHO its seems #649 and this item, are two changes that are trying to solve the same architectural problem - better data ingestion, or the inverse, an anomalous software architecture, resulting in an overloaded server. https://github.com/ODM2/ODM2DataSharingPortal/issues/649#issuecomment-2386714593
Practically, it seems to me the issue is how to transfer data from a node for a low cost. The POST architecture with massively redundant UUIDs seems to be a super slippery banana skin, with many unintended consequences, from poor wireless transmission performance of large payloads (made worse with #649), to problematic data ingestion (throughput improved with #649), and only works one way.
I wonder if its a time to reconsider a scale able well tested method, that is an industry standard, MQTT - with bidirectional capability. I did see somewhere that MQTT was considered for ODM2 in the early days. I've never seen a rationale for using the massive overheads of UUIDs.
MQTT would present an interface version/discontinuity - however it would be extensible. I even have an MQTT server running on a Raspberry Pi Zero W.
Just my 2cents in to the wishing pond - please ignore if its a dumb idea.
@neilh10, we love the ideas of creating a lighter weight POST request or implanting MQTT! We put these on the roadmap in 2017 and 2018!
Since then, implementing these approaches gotten even easier with the development of AWS IoT Core.
Due to the need to support existing deployments, however, we've prioritized strengthening the existing POST implementation and backend infrastructure before adding any new approaches. Unfortunately, due to limited funding, we're moving much slower than we would like through these initial strengthening steps.
Implement AWS Simple Queue Service (SQS) to receive post requests with data from EnviroDIY devices, return a 202 Accepted (or other 200-level) Status code, and then push that data to the backend database as it is ready.
This would have three major benefits to users:
This also opens the door for us to bundle post request into more efficient write commands to the database.