ODM2 / ODM2DataSharingPortal

A Python-Django web application enabling users to upload, share, and display data from their environmental monitoring sites via the app's ODM2 database. Data can either be automatically streamed from Internet of Things (IoT) devices, manually uploaded via CSV files, or manually entered into forms.
BSD 3-Clause "New" or "Revised" License
31 stars 8 forks source link

Batch upload protocol extension #649

Open tpwrules opened 1 year ago

tpwrules commented 1 year ago

The JSON REST request data upload should be extended to support multiple data points per request and thus reduce data usage by the dataloggers.

This can be accomplished by replacing the string value of the timestamp key and number values of the UUID keys with arrays of each. Some more details:

Example of the old format sending one data point (which is also a valid example of the new format):

{
    "sampling_feature": "f319af6a-3091-4070-b3ad-a606a7fdbed4",
    "timestamp": "2016-12-08T14:45:01-07:00",
    "f8fbf90e-f59d-4736-af66-91fbee455433": 8,
    "52e6d5ce-eca1-4545-9b01-607a487cbfc0": 10
}

Example of the new format, sending two data points at once collected 5 minutes apart:

{
    "sampling_feature": "f319af6a-3091-4070-b3ad-a606a7fdbed4",
    "timestamp": ["2016-12-08T14:45:01-07:00","2016-12-08T14:50:01-07:00"],
    "f8fbf90e-f59d-4736-af66-91fbee455433": [8,3],
    "52e6d5ce-eca1-4545-9b01-607a487cbfc0": [10,9],
}

The advantages of this extension are ease of implementation on the server, backwards compatibility with existing users, and the dramatic efficiency improvements gained from batching even a few data points. The only possible disadvantage I see is the fact that the server will need to do more work for each request because it will contain more data (but keep in mind there will be fewer total requests). I had done some calculations on using other protocols (e.g. CoAP over UDP) and did not find that they offered significant improvements compared to the proposed change.

I have a not-yet-public prototype of this for the server, including some easy performance improvements, and a working implementation for the EnviroDIY Mayfly 1.1 hardware. We have been testing the hardware for over a month using a conversion proxy to submit data to the official site and seen really great results. The sooner we get this into the official site, the sooner we can send the code changes for that and the sooner users can benefit.

Please let me know the timeline for implementing this on your end. If it can be done quickly as a prototype, then efficiency can be improved in the backend later. Otherwise, I can clean up my work and submit a PR. But I don't know how it will interact with existing features.

neilh10 commented 1 year ago

AS this was mentioned in #658, it would be intrigued to have this issue explained in terms of the architecture of the server.

Where is the servers inefficiency's and architectural bottlenecks.? Is looking up the UUIDs heavy on server resources. How large could the number of batch readings "N" be.

As stated, with the current reliabilities of the server (v0.15.0) in end-2-end I can't see that it has a lot of value to the device.

Based on current characterization of the Mayfly, to improve predictability of power usage and reduce power on the device, the reliability of the server is key. https://github.com/ODM2/ODM2DataSharingPortal/issues/95 Its typical for a software system that this is characterized and that as loading increases it is monitored.

I believe a key engineering consideration in the device software is how to design for reliability, both in data collection from the sensors and also to ensure the delivery to the internet database. The device needs an atomic handshake on data received by the server, and ability to mark that internally.

What effect does it have to just serialize the two requests with an atomic 201 to indicate both received, or none received and characterize the servers reliability. { "sampling_feature": "f319af6a-3091-4070-b3ad-a606a7fdbed4", "timestamp": "2016-12-08T14:45:01-07:00", "f8fbf90e-f59d-4736-af66-91fbee455433": 8, "52e6d5ce-eca1-4545-9b01-607a487cbfc0": 10 } { "sampling_feature": "f319af6a-3091-4070-b3ad-a606a7fdbed4", "timestamp": ,"2016-12-08T14:50:01-07:00", "f8fbf90e-f59d-4736-af66-91fbee455433": [8,3], "52e6d5ce-eca1-4545-9b01-607a487cbfc0": [10,9], }

OR a comparison with server overheads (sorry for any mangled JSON) { "sampling_feature": "f319af6a-3091-4070-b3ad-a606a7fdbed4", "records":[ {"timestamp": "2016-12-08T14:45:01-07:00", "f8fbf90e-f59d-4736-af66-91fbee455433": 8.2, "52e6d5ce-eca1-4545-9b01-607a487cbfc0": 10.2}, {"timestamp": ,"2016-12-08T14:50:01-07:00", "f8fbf90e-f59d-4736-af66-91fbee455433": [8,3], "52e6d5ce-eca1-4545-9b01-607a487cbfc0": [10,9] } ] }

tpwrules commented 1 year ago

The main intent of this improvement is to reduce the amount of traffic between the devices and servers. By batching the data, we can amortize TCP overhead, HTTP overhead, and UUID overhead and achieve dramatic reduction in data consumption because one data point is small in terms of bytes relative to these other factors. Reduced data consumption leads to reduced costs for the cell service (we see at least an order of magnitude reduced costs) and reduced power consumption for transmission (some 2-3x). A more sophisticated protocol (e.g. the mentioned CoAP over UDP) would not provide much improvement over these (in fact it might not even provide as much if batching is not used) and would complicate implementation, but it could be explored in the future.

Regarding atomicity, this proposal does not need true atomic functions because of the fact that duplicate data points are safely ignored. The client knows that all points submitted have been inserted if (and only if) the server responds with a 201 and can then drop them from its buffer. Otherwise the client can retry submission at a later time, potentially with a larger or different set of points. If the points have already been submitted but the response did not make it back (which is unlikely), then some time/data are wasted, but nothing is lost.

Regarding server reliability, I did do considerable investigation and improvements here too. But I have not seen much movement here so I have not yet spent the effort to clean up my changes and file a PR. I have pushed my prototype changes here for the curious. I can tell you the two variations you have posted would not improve either data consumption or the server reliability much, if at all.

The primary bottleneck with the server in its official incarnation is actually inserting data records into the database due to inefficient use of the ORM and transactions and subsequent timeouts from the lengthy processing. Improving this is pretty simple and results in several times more speed for a single point. The speed difference increases as batches are used because more fixed costs can be amortized.

My experiments have reached the point where the bottleneck is the database itself. This can be improved even further, but would require changes to the schema to reduce processing load. However, even at this point, speeds are orders of magnitude higher than before, so the effort might not be worth it.

neilh10 commented 1 year ago

@tpwrules thanks for the insights and for pushing your prototype changes. (though server isn't my area) I hear you in this method reduces the total amount of data posted - I personally find the original extensive usage of UUIDs for data readings bizarre overhead, - oh well, historical momentum. Thanks also for the reference to power savings on reducing transmitted characters.

Thanks for talking about the data bases inefficient usage - that has always been a big challenge with IoT device scaling. A deterministic ACK/201 is required for low power usage on the device/Mayfly.

The implementation of MonitorMyWatershed.org/ODM responses is all I see, and they are where I see the biggest inefficiency is the large number of timeouts. Its something that I've done some characterization on, and I would hope these efficiency's you are talking about can be looked at.

With the current design, as the number of database rows grow, is the insertion time going to remain linear, or is it likely to increase.? (I've been following the flat database discussions to keep it linear)

One of my Mayfly devices using WiFi in a shady location, is showing the challenge of managing the power - the power available, (proxy battery voltage on 2 * 4.4Ahr batteries) while it was having timeouts since the Apr 12th MMW upgrade https://monitormywatershed.org/tsv/TUCA_Sa01/5979/ Now its still getting a lot of timeouts from the data queued from Apr 12.

tpwrules commented 1 year ago

Timeouts might be something you can increase on the device side, if you are okay with more power consumption. I don't actually know what the timeout for processing on the official MMW setup is, but requests have the potential to queue indefinitely if the client is patient enough.

Theoretically the insertion time does increase slightly with the number of database rows, but I don't think the activity on MMW is enough for that to really have an effect. I was able to get hundreds of data points per second on my system IIRC.

Even so, I think in the future a different and smaller program connected to a smaller database could return the affirmative response, then migrate it to the main database at a later time. That's effectively what my system does because of the protocol difference. But the current level of activity is still orders of magnitude away from needing this type of solution I believe. Some simple improvements should massively improve things and I hope they can be made.

neilh10 commented 1 year ago

Thanks for sharing the details of a potential speedup and fantastic to have this open for discussion. I’m not deeply embedded in the ODM functionality, but have a perspective from a “White Box” testing. https://en.wikipedia.org/wiki/White-box_testing

https://github.com/ODM2/ODM2DataSharingPortal/pull/674#issuecomment-1724293020

From a high-level view, seems to me that an enhancement of this nature on the server would also attract a corresponding internet-based test. I’ve spent a lot of time in commercial companies planning and executing test integrations. I haven’t seen any objectives for testing or how to characterize server response, so raising a red flag. I should point out that I’ve been on the “bleeding edge” of some of the server challengers, and just reflected what I see using a simple Mayfly device with limited bandwidth, so I do believe I’ve earned brownie points to be able to make the following comments. 😊

The critical characteristic for any server-side implementation is the rate of new arriving POSTs – and does the server degrade gracefully.

The reality of large software packages, with many dependencies, is that changes are a learning process. They can result in knock on effects, and verification testing is setup to characterize the system in a controlled. (Scrum methodologies are often used for cross functional teams. To quote https://www.atlassian.com/agile/scrum “It acknowledges that the team doesn’t know everything at the start of a project and will evolve through experience” )

With creating a test server, this can allow for dummy load testing from an internet location, and define & manage the rate of new posts. This allows better understand of the internal monitoring. Engineering often uses an impulse pulse test to define the characteristics of the system. That is define a set of high-speed POSTs that are close to the limit of the receiving system, for a short period of time, and then characterize its response. Or alternatively, define the tests so that the rate of POSTs can be increased to find out where it hits the limit. This is a short-term aggressive test that can simplify overall testing. Once defined it can also be used to regression test in an automated way. Doing it this way also allows the servers implementation to be independently verified from the client Device. From an Engineering perspective this reduces the change to two separate simpler changes. This is likely to have a lower cost (human hours) of overall implementation, as the focus is on one problem at a time.

It should be noted, that testing from the device side, which I have been doing – has resulted in a series of painful discoveries, and probably has been expensive in server support time in fixing these discoveries. (Which I greatly appreciate). This is a well known software development phenomenon as part of an upfront repeatable testing (https://en.wikipedia.org/wiki/Capability_Maturity_Model)

Once the server is in a stable tested configuration, then there is the benefit of integration testing with Mayfly device. This should be simple process, since now the server will be purring and there should be nothing that a simple bandwidth limited device can cause to the server, other than unforeseen real-world conditions from the wireless transmission. https://github.com/EnviroDIY/ModularSensors/issues/454

When both changes are tackled together, then there is also the wireless system that is inbetween. On wireless systems, that are on marginal connections, then then larger packets are more likely to be corrupted/undelivered.

On a related observation – since this use wireless transmission, and wireless is inherently error prone, as the RSSI goes down, and packet size goes up – there is less likely the packet will be received. So its likely that Mayfly devices that are on the edge of current reception, will now see no reception with a larger packet of JSON data. An adaptive transmission method could adjust for this, however it will be on the Mayfly software to try adapting.

tpwrules commented 1 year ago

I have some testing scripts I wrote to test my copy of the server and target which I used to guide my optimization. I did test these changes and ideas without the devices. My perspective was that there were a few simple and low-risk changes which would give a large benefit. Hopefully they can be implemented soon.

Unfortunately, it seems in conversations that there is simply not funding available right now to increase maturity in these ways. I recognize they are needed too but that stuff isn't free. I also can't donate infinite time to fix them.

neilh10 commented 1 year ago

I wonder if you could share the testing scripts. I have thought of checking out node to write some, however the response of the system has improved since I identified bottlenecks, https://github.com/ODM2/ODM2DataSharingPortal/issues/641

I guess, my suggestion is to discuss the planning for an integration cycle that this JSON extension implies, should it ever be implemented.
To reiterate, from an engineering evaluation, I would see this extension being tested for completeness in a standalone manner. From past experience with software system I would suggest this is the lowest cost in core cloud support human hours of ensuring the extension is implemented for a stable system, with few surprises down the road. Though this just my 2cents view.

It seems to me that for a cloud based server its very typical to discuss the throughput - a hard subject, and typically needs characterizing. That throughput number then becomes a discussion on evolution of the software system.

I'm taking an educated guess, that the proposed method of JSON extensions could improve the through put by X (say 4 if there are 4 sampling records) - so could be very valuable for the discussion on throughput.

For a business, the through put would be business secret on how to manage the hoped for growth of devices.

For an open source system, with a number of people contributing, seems like this should be visible confidence boosting number that is discussed, and targets identified in the project management. At least that for any changes it doesn't slip backwards.

aufdenkampe commented 4 weeks ago

We're moving forward on deploying these features with this PR:

Which was cherry-picked from the original:

neilh10 commented 4 weeks ago

Just wondering if this will be released on a stage server, or is it a big bang.

aufdenkampe commented 4 weeks ago

@neilh10, yes, #732 was just deployed to the staging server today! See: