aerosense-ai / data-gateway

Data influx for Aerosense.
https://www.aerosense.ai/
Other
3 stars 1 forks source link

Reduce BigQuery costs #126

Closed thclark closed 1 year ago

thclark commented 1 year ago

Feature request

We need to reduce spiralling table costs. I thought I’d look into the best strategies for sorting this beforehand, and figured that rather than barfing out suggestions in the meeting it might be better to put my thoughts down on paper.

Background:

BigQuery is columnar, so any query for multiple column values must traverse every row in the dataset, even if you’re filtering out rows. This means that if you query for data in the last 24 hours, thousands of values from before 24 hours ago still get loaded whilst processing the query. We’d not anticipated this, and it was pointed out by Yuriy late last year so we’ve known about it for a while.

Our research back then suggested that clustering and partitioning might be good strategies for reducing this (they both have pros/cons) but we felt it wasn’t worth it at that stage, as the DB was costing on average less than a hundred bucks a month. There’s no point spending a few thousand pounds to save a few hundred.

Today’s research:

I’ve spent today planning and researching more deeply into clustering and partitioning. I think that:

I think we should:

Plan of Action

I think if we all do this as a group, we could block out a whole day and have it just done. It’ll be inefficient on our time doing it that way, but it’ll get it over with, avoiding asynchronous communication delays and spanning across many days.

thclark commented 1 year ago

Queries

aerosense-tools

get_sensor_data

        data_query = f"""
        SELECT *
        FROM `{DATASET_NAME}.sensor_data_{sensor_type_reference}`
        WHERE datetime BETWEEN @start AND @finish
        AND installation_reference = @installation_reference
        AND node_id = @node_id
        ORDER BY datetime
        """

get_sensor_data_at_datetime

        query = f"""
        SELECT *
        FROM `{DATASET_NAME}.sensor_data_{sensor_type_reference}`
        WHERE datetime >= @start_datetime
        AND datetime < @finish_datetime
        AND installation_reference = @installation_reference
        AND node_id = @node_id
        ORDER BY datetime
        """

get_aggregated_connection_statistics

        query = f"""
        SELECT datetime, filtered_rssi, raw_rssi, tx_power, allocated_heap_memory
        FROM `{DATASET_NAME}.connection_statistics_agg`
        WHERE datetime BETWEEN @start AND @finish
        AND installation_reference = @installation_reference
        AND node_id = @node_id
        ORDER BY datetime
        """

get_microphone_metadata

        query = f"""
        SELECT *
        FROM `{DATASET_NAME}.microphone_data`
        WHERE datetime BETWEEN @start AND @finish
        AND installation_reference = @installation_reference
        AND node_id = @node_id
        """

get_installations


        query = f"""
        SELECT reference, turbine_id, location
        FROM `{DATASET_NAME}.installation`
        ORDER BY reference
        """

get_sensor_types

        query = f"""
        SELECT name, metadata
        FROM `{DATASET_NAME}.sensor_type`
        ORDER BY name
        """

get_nodes

        query = f"""
        SELECT node_id FROM `{DATASET_NAME}.sensor_data`
        WHERE installation_reference = @installation_reference
        GROUP BY node_id
        ORDER BY node_id
        """
thclark commented 1 year ago

Note: Additional query to get count is undertaken before the get_sensor_data query. We should consider wrapping this into the original query (for improved performance, avoiding a duplicate query roundtrip). In terms of clustering it applies the same conditions so shouldn't affect us here.

    count_query = f"""
        SELECT COUNT(datetime)
        FROM `{DATASET_NAME}.sensor_data_{sensor_type_reference}`
        WHERE datetime BETWEEN @start AND @finish
        AND installation_reference = @installation_reference
        AND node_id = @node_id
    """
thclark commented 1 year ago

Materialized view queries

``

SELECT datetime, sensor_value[ORDINAL(1)] as filtered_rssi, sensor_value[ORDINAL(2)] as raw_rssi, sensor_value[ORDINAL(3)] as tx_power, sensor_value[ORDINAL(4)] as allocated_heap_memory, configuration_id, installation_reference, node_id FROM `aerosense-twined.greta.sensor_data` 
WHERE sensor_type_reference='connection_statistics'
AND IS_NAN(sensor_value[ORDINAL(1)]) IS FALSE
AND IS_NAN(sensor_value[ORDINAL(2)]) IS FALSE
AND IS_NAN(sensor_value[ORDINAL(3)]) IS FALSE
AND IS_NAN(sensor_value[ORDINAL(4)]) IS FALSE;

sensor_data_accelerometer

SELECT datetime, sensor_value[SAFE_ORDINAL(1)], sensor_value[SAFE_ORDINAL(2)], sensor_value[SAFE_ORDINAL(3)], configuration_id, installation_reference, node_id FROM `aerosense-twined.greta.sensor_data`
WHERE sensor_type_reference = "accelerometer"
cortadocodes commented 1 year ago

Comparing querying the non-clustered (left) and clustered (right) sensor data tables - clustered comes out much better! The data processed is reduced by almost 2000 times 😮

Screenshot 2023-06-21 at 12 43 46

time-trader commented 1 year ago

After looking more into it, it seems like clustering based on "datetime" has to do with performance optimisation rather than fine grain control on amount of data processed : So for the future, if needed, to optimise costs further for more scalable system we should probably use hybrid approach

Partitioning daily +

Clustering partitioned tables with this order:

In this case there is no need to cluster based on measurement_campaign_reference, as these usually would last more than one day.

Alternatively, if we choose not to partition, the order can be:

time-trader commented 1 year ago

Given it more thought: each sensor type attachment we produce can have its own id like "kraken-of-the-abyss-barometers", "bloom-of-the-summer-barometers", and "submarine-of-the-zürichsee-differential-pressure" etc., with those ids included in the "installation" table. The sensor data_table then can entirely ditch 3 columns: "installation_reference", "node_id" and "sensor_type" and could be clustered just by these unique ids. This plus daily partitioning will be fully scalable.