Closed thclark closed 1 year ago
get_sensor_data
data_query = f"""
SELECT *
FROM `{DATASET_NAME}.sensor_data_{sensor_type_reference}`
WHERE datetime BETWEEN @start AND @finish
AND installation_reference = @installation_reference
AND node_id = @node_id
ORDER BY datetime
"""
get_sensor_data_at_datetime
query = f"""
SELECT *
FROM `{DATASET_NAME}.sensor_data_{sensor_type_reference}`
WHERE datetime >= @start_datetime
AND datetime < @finish_datetime
AND installation_reference = @installation_reference
AND node_id = @node_id
ORDER BY datetime
"""
get_aggregated_connection_statistics
query = f"""
SELECT datetime, filtered_rssi, raw_rssi, tx_power, allocated_heap_memory
FROM `{DATASET_NAME}.connection_statistics_agg`
WHERE datetime BETWEEN @start AND @finish
AND installation_reference = @installation_reference
AND node_id = @node_id
ORDER BY datetime
"""
get_microphone_metadata
query = f"""
SELECT *
FROM `{DATASET_NAME}.microphone_data`
WHERE datetime BETWEEN @start AND @finish
AND installation_reference = @installation_reference
AND node_id = @node_id
"""
get_installations
query = f"""
SELECT reference, turbine_id, location
FROM `{DATASET_NAME}.installation`
ORDER BY reference
"""
get_sensor_types
query = f"""
SELECT name, metadata
FROM `{DATASET_NAME}.sensor_type`
ORDER BY name
"""
get_nodes
query = f"""
SELECT node_id FROM `{DATASET_NAME}.sensor_data`
WHERE installation_reference = @installation_reference
GROUP BY node_id
ORDER BY node_id
"""
Note: Additional query to get count is undertaken before the get_sensor_data
query. We should consider wrapping this into the original query (for improved performance, avoiding a duplicate query roundtrip).
In terms of clustering it applies the same conditions so shouldn't affect us here.
count_query = f"""
SELECT COUNT(datetime)
FROM `{DATASET_NAME}.sensor_data_{sensor_type_reference}`
WHERE datetime BETWEEN @start AND @finish
AND installation_reference = @installation_reference
AND node_id = @node_id
"""
``
SELECT datetime, sensor_value[ORDINAL(1)] as filtered_rssi, sensor_value[ORDINAL(2)] as raw_rssi, sensor_value[ORDINAL(3)] as tx_power, sensor_value[ORDINAL(4)] as allocated_heap_memory, configuration_id, installation_reference, node_id FROM `aerosense-twined.greta.sensor_data`
WHERE sensor_type_reference='connection_statistics'
AND IS_NAN(sensor_value[ORDINAL(1)]) IS FALSE
AND IS_NAN(sensor_value[ORDINAL(2)]) IS FALSE
AND IS_NAN(sensor_value[ORDINAL(3)]) IS FALSE
AND IS_NAN(sensor_value[ORDINAL(4)]) IS FALSE;
sensor_data_accelerometer
SELECT datetime, sensor_value[SAFE_ORDINAL(1)], sensor_value[SAFE_ORDINAL(2)], sensor_value[SAFE_ORDINAL(3)], configuration_id, installation_reference, node_id FROM `aerosense-twined.greta.sensor_data`
WHERE sensor_type_reference = "accelerometer"
Comparing querying the non-clustered (left) and clustered (right) sensor data tables - clustered comes out much better! The data processed is reduced by almost 2000 times 😮
After looking more into it, it seems like clustering based on "datetime" has to do with performance optimisation rather than fine grain control on amount of data processed : So for the future, if needed, to optimise costs further for more scalable system we should probably use hybrid approach
Partitioning daily +
Clustering partitioned tables with this order:
In this case there is no need to cluster based on measurement_campaign_reference, as these usually would last more than one day.
Alternatively, if we choose not to partition, the order can be:
Given it more thought: each sensor type attachment we produce can have its own id like "kraken-of-the-abyss-barometers", "bloom-of-the-summer-barometers", and "submarine-of-the-zürichsee-differential-pressure" etc., with those ids included in the "installation" table. The sensor data_table then can entirely ditch 3 columns: "installation_reference", "node_id" and "sensor_type" and could be clustered just by these unique ids. This plus daily partitioning will be fully scalable.
Feature request
We need to reduce spiralling table costs. I thought I’d look into the best strategies for sorting this beforehand, and figured that rather than barfing out suggestions in the meeting it might be better to put my thoughts down on paper.
Background:
BigQuery is columnar, so any query for multiple column values must traverse every row in the dataset, even if you’re filtering out rows. This means that if you query for data in the last 24 hours, thousands of values from before 24 hours ago still get loaded whilst processing the query. We’d not anticipated this, and it was pointed out by Yuriy late last year so we’ve known about it for a while.
Our research back then suggested that clustering and partitioning might be good strategies for reducing this (they both have pros/cons) but we felt it wasn’t worth it at that stage, as the DB was costing on average less than a hundred bucks a month. There’s no point spending a few thousand pounds to save a few hundred.
Today’s research:
I’ve spent today planning and researching more deeply into clustering and partitioning. I think that:
I think we should:
Cluster the sensor_data table and possibly the microphone_data table while we’re at it. Cluster indexes should be in one of these orders:
To optimise costs on current queries (reviewed below):
To optimise costs for more scalable system in future
But we can’t apply clustering to the existing table(s) meaning we need a table migration
Plan of Action
conection_statistics_aggpoints to connection_statistics(meanwhile) update aerosense-tools, the data-gateway ingress and the dashboard to point to the new tablesdeploy the updated set of toolsI think if we all do this as a group, we could block out a whole day and have it just done. It’ll be inefficient on our time doing it that way, but it’ll get it over with, avoiding asynchronous communication delays and spanning across many days.