thclark commented 1 year ago

Feature request

We need to reduce spiralling table costs. I thought I’d look into the best strategies for sorting this beforehand, and figured that rather than barfing out suggestions in the meeting it might be better to put my thoughts down on paper.

Background:

BigQuery is columnar, so any query for multiple column values must traverse every row in the dataset, even if you’re filtering out rows. This means that if you query for data in the last 24 hours, thousands of values from before 24 hours ago still get loaded whilst processing the query. We’d not anticipated this, and it was pointed out by Yuriy late last year so we’ve known about it for a while.

Our research back then suggested that clustering and partitioning might be good strategies for reducing this (they both have pros/cons) but we felt it wasn’t worth it at that stage, as the DB was costing on average less than a hundred bucks a month. There’s no point spending a few thousand pounds to save a few hundred.

Today’s research:

I’ve spent today planning and researching more deeply into clustering and partitioning. I think that:

We won’t totally eliminate query costs (of course) but should be able to dramatically curb them.
this is likely to have the corollary benefit of speeding response times on the dashboard.
We can do so in a way that doesn’t scale as we do more installations / measurement campaigns.
We should be able to achieve the same cost reduction with partitioning+clustering as we could get with clustering alone.
Partitioning allows improved estimation of query cost a priori, whereas with clustering you can’t know a priori to running a query how its cost is affected.
We don’t estimate query costs a priori for any of our features so I don’t think Partitioning has any real benefit for us.
Partitioning can only be done on one column, whereas clustering has the benefit of being able to specify multiple columns in priority order. So even if we did bother with partitioning, we’d also have to cluster the table to achieve the same benefit.

I think we should:

Cluster the sensor_data table and possibly the microphone_data table while we’re at it. Cluster indexes should be in one of these orders:

To optimise costs on current queries (reviewed below):
- sensor_type_reference
- installation_reference
- node_id
- datetime
To optimise costs for more scalable system in future
- datetime ?? Location TBC depending on whether queries are more datetime focused (eg lots of checking what happened in the last day vs show me the timeseries for this entire campaign)??
- installation_reference
- measurement_campaign_reference
- datetime ??
- node_id
- sensor_type_reference
But we can’t apply clustering to the existing table(s) meaning we need a table migration

Plan of Action

[x] review our existing queries to check my suggested clustering strategy and index order
- [x] gateway ingress functions
- [x] aerosense-tools
- [x] dashboard should use aerosense-tools but check.
- [x] Materialized views
[x] use the same schema as the existing tables, but with added clustering, to create new data tables.
[x] halt all data acquisition (to prevent loss at ingress)
[x] run a job to copy all data from unclustered to clustered tables
[x] Update materialized views to new table:
- [x] connection_statistics
- [x] ~~conection_statistics_agg~~ points to connection_statistics
- [x] sensor_data_accelerometer
- [x] sensor_data_barometer
- [x] sensor_data_barometer_thermometer
- [x] sensor_data_battery_info
- [x] sensor_data_battery_voltmeter
- [x] sensor_data_differential_barometer
- [x] sensor_data_gyroscope
- [x] magnetometer
[x] ~~(meanwhile) update aerosense-tools, the data-gateway ingress and the dashboard to point to the new tables~~
[x] ~~deploy the updated set of tools~~
[x] verify new tables as being duplicates
[x] restart data acquisition which should ingress to new tables and verify ingress.

I think if we all do this as a group, we could block out a whole day and have it just done. It’ll be inefficient on our time doing it that way, but it’ll get it over with, avoiding asynchronous communication delays and spanning across many days.

thclark commented 1 year ago

Queries

aerosense-tools

get_sensor_data

        data_query = f"""
        SELECT *
        FROM `{DATASET_NAME}.sensor_data_{sensor_type_reference}`
        WHERE datetime BETWEEN @start AND @finish
        AND installation_reference = @installation_reference
        AND node_id = @node_id
        ORDER BY datetime
        """

get_sensor_data_at_datetime

        query = f"""
        SELECT *
        FROM `{DATASET_NAME}.sensor_data_{sensor_type_reference}`
        WHERE datetime >= @start_datetime
        AND datetime < @finish_datetime
        AND installation_reference = @installation_reference
        AND node_id = @node_id
        ORDER BY datetime
        """

get_aggregated_connection_statistics

        query = f"""
        SELECT datetime, filtered_rssi, raw_rssi, tx_power, allocated_heap_memory
        FROM `{DATASET_NAME}.connection_statistics_agg`
        WHERE datetime BETWEEN @start AND @finish
        AND installation_reference = @installation_reference
        AND node_id = @node_id
        ORDER BY datetime
        """

get_microphone_metadata

        query = f"""
        SELECT *
        FROM `{DATASET_NAME}.microphone_data`
        WHERE datetime BETWEEN @start AND @finish
        AND installation_reference = @installation_reference
        AND node_id = @node_id
        """

get_installations


        query = f"""
        SELECT reference, turbine_id, location
        FROM `{DATASET_NAME}.installation`
        ORDER BY reference
        """

get_sensor_types

        query = f"""
        SELECT name, metadata
        FROM `{DATASET_NAME}.sensor_type`
        ORDER BY name
        """

get_nodes

        query = f"""
        SELECT node_id FROM `{DATASET_NAME}.sensor_data`
        WHERE installation_reference = @installation_reference
        GROUP BY node_id
        ORDER BY node_id
        """

thclark commented 1 year ago

Note: Additional query to get count is undertaken before the get_sensor_data query. We should consider wrapping this into the original query (for improved performance, avoiding a duplicate query roundtrip). In terms of clustering it applies the same conditions so shouldn't affect us here.

    count_query = f"""
        SELECT COUNT(datetime)
        FROM `{DATASET_NAME}.sensor_data_{sensor_type_reference}`
        WHERE datetime BETWEEN @start AND @finish
        AND installation_reference = @installation_reference
        AND node_id = @node_id
    """

thclark commented 1 year ago

Materialized view queries

``

SELECT datetime, sensor_value[ORDINAL(1)] as filtered_rssi, sensor_value[ORDINAL(2)] as raw_rssi, sensor_value[ORDINAL(3)] as tx_power, sensor_value[ORDINAL(4)] as allocated_heap_memory, configuration_id, installation_reference, node_id FROM `aerosense-twined.greta.sensor_data` 
WHERE sensor_type_reference='connection_statistics'
AND IS_NAN(sensor_value[ORDINAL(1)]) IS FALSE
AND IS_NAN(sensor_value[ORDINAL(2)]) IS FALSE
AND IS_NAN(sensor_value[ORDINAL(3)]) IS FALSE
AND IS_NAN(sensor_value[ORDINAL(4)]) IS FALSE;

sensor_data_accelerometer

SELECT datetime, sensor_value[SAFE_ORDINAL(1)], sensor_value[SAFE_ORDINAL(2)], sensor_value[SAFE_ORDINAL(3)], configuration_id, installation_reference, node_id FROM `aerosense-twined.greta.sensor_data`
WHERE sensor_type_reference = "accelerometer"

cortadocodes commented 1 year ago

Comparing querying the non-clustered (left) and clustered (right) sensor data tables - clustered comes out much better! The data processed is reduced by almost 2000 times 😮

Screenshot 2023-06-21 at 12 43 46

time-trader commented 1 year ago

After looking more into it, it seems like clustering based on "datetime" has to do with performance optimisation rather than fine grain control on amount of data processed : So for the future, if needed, to optimise costs further for more scalable system we should probably use hybrid approach

Partitioning daily +

Clustering partitioned tables with this order:

installation_reference
node_id
sensor_type_reference

In this case there is no need to cluster based on measurement_campaign_reference, as these usually would last more than one day.

Alternatively, if we choose not to partition, the order can be:

measurement_campaign_reference
node_id
sensor_type_reference

time-trader commented 1 year ago

Given it more thought: each sensor type attachment we produce can have its own id like "kraken-of-the-abyss-barometers", "bloom-of-the-summer-barometers", and "submarine-of-the-zürichsee-differential-pressure" etc., with those ids included in the "installation" table. The sensor data_table then can entirely ditch 3 columns: "installation_reference", "node_id" and "sensor_type" and could be clustered just by these unique ids. This plus daily partitioning will be fully scalable.

aerosense-ai / data-gateway

Reduce BigQuery costs #126

Feature request

Background:

Today’s research:

I think we should:

Plan of Action

Queries

aerosense-tools

Materialized view queries