ODM2 / ODM2DataSharingPortal

A Python-Django web application enabling users to upload, share, and display data from their environmental monitoring sites via the app's ODM2 database. Data can either be automatically streamed from Internet of Things (IoT) devices, manually uploaded via CSV files, or manually entered into forms.
BSD 3-Clause "New" or "Revised" License
31 stars 8 forks source link

potential db compression settings? #665

Closed neilh10 closed 7 months ago

neilh10 commented 1 year ago

Its likely systems in the field stop transmitting readings and then to reconnect at a later date and upload stored readings. #485

It would be optimal for there to be a model for what historical timeframe might be supported.

I'm just wondering what are the current internal settings fort the timescaledb compression are, if any. https://www.timescale.com/features/compression

Examples are

658 - had three systems interrupt Mar 12, and reconnected May 23rd and and when restored had three systems upload data.

Salmon Creek - that had its solar power cable pulled May 15, and restored June 27th and on reconnection transmitted data up https://monitormywatershed.org/sites/TUCA_Sa01/ Mill Creek installed June 12, and stopped a couple of days later https://monitormywatershed.org/sites/TUCA_Mi06/ or very remote Navarro Creek that stopped connecting June 2 https://monitormywatershed.org/sites/TUCA-Na13/

As I understand it, the timescaledb will compress data based on a historical timeframe , and then later if there is a POST to that period it will un-compress the rows associated with the device, then insert the new record. Its likely that the un-compression will take time, and could exceed the 10second timeout that the device will wait for an http 201.
In that case the device will retry on the next connection, and continue retrying till a http 201 is received.

Then I'm guessing then on some schedule, timedb will recompress the data.

neilh10 commented 1 year ago

I guess I'm not clear is MMW / ODM2 using timescaledb or influxdb . Influxdb is referenced in the architecture diagram outlined 4yrs ago https://github.com/ODM2/ODM2DataSharingPortal/blob/main/doc/ArchitectureDiagram/Data%20Sharing%20Portal%20Architecture%20with%20Logos%20-%20Copy.png

aufdenkampe commented 7 months ago

@neilh10, we migrated from InfluxDB to the PosgreSQL Timescale extension (i.e. TimescaleDB) with our v0.12 release in December 2021. We've unfortunately gotten behind on updating all of our documentation to reflect cumulative changes since then (outside of our release notes).

To answer the original question, we're compressing old data into 90-day chunks, that are auto-created every 90 days, with a 1-month buffer. So there is always at least 1 month of uncompressed data. See https://github.com/ODM2/ODM2DataSharingPortal/issues/502#issuecomment-942649040

Regardless, the server-load required to insert historical data should get substantially improved when we implement #674 with our planned v0.18 Milestone.

aufdenkampe commented 7 months ago

Closing to track work on this under:

neilh10 commented 7 months ago

@aufdenkampe thanks for the update. I'm just wondering then what is the impact when a POST is made with a time stamp that is in a range that is compressed.
Of course this issue probably will still exist with #688 , and I wonder what that effect will be of it being queued.

aufdenkampe commented 7 months ago

The effect of a post of data into a compressed time chunk is that the server needs to use substantially more resources to insert the data, because the entire chunk needs to get decompressed, appended, then compressed again.

This is only for updates. Reading from a compressed chuck is fast.

688 does alleviate the problem because it spreads out the work load. We don't mind our server doing the work. The issue is just getting too many posts at once leads to the server getting overloaded in the minute when they all arrive. We have a ton of time where the server CPU is idle. The point of SQS is spread the work into that idle time.