Open freol35241 opened 3 years ago
Hey ! Are you still looking for help? I did some experimentation with my HA instance.
Table "public.ltest"
Column | Type | Collation | Nullable | Default
---------------------+--------------------------+-----------+----------+---------
time | timestamp with time zone | | not null |
entity_id | text | | not null |
friendly_name | text | | |
device_class | text | | |
state | text | | not null |
state_class | text | | |
unit_of_measurement | text | | |
attributes | jsonb | | not null |
location | geometry(Point,4326) | | |
segmentby = 'device_class, entity_id'
, orderby = 'time'
attributes
since they are frequent
Final database is 38MB for 4.2M metrics, 98% compression rate, ~6MB/week. I don't expect to need a data retention policy for quite some time!I'm still experimenting with indexes on device_class
, entity_id
and device_class
. Not that compressed chunks are only indexed on these three columns since the rest is compressed.
I also tried to add computed columns state_bool
and state_float
with custom try_cast_float
function (returns float or null) but it caused some problem with ltss compatibility so for now I have a simple view that add these columns.
@Yamakaky Very interesting, thanks for reporting back!
Can you please clarify:
I'm still experimenting with indexes on device_class, entity_id and device_class. Not that compressed chunks are only indexed on these three columns since the rest is compressed.
Did you mean device_class
, entity_id
and time
rather than device_class
, entity_id
and device_class
?
Also, for my understanding of your tests, since the compressed chunks are not indexed on the additional columns you extract from attributes
(except device_class
) what is the benefit of extracting these (except convenience)?
My main issue with the compression feature of timescaledb (when I first had a look at it) was that schema changes was not supported at all. Do you know if this has changed for the better?
No, I meant indexes with different combination of these three columns + time. For now I didn't do much more on that subject since the indexes mostly depend on which queries are used.
Yes, the compressed chunks are only have single column indexes on the group_by columns. The other indexes are only valid for the uncompressed days. This is still useful, for example i have a view with time > now() - interval '7 days'
with the last value for every entity.
Some things are supported, but yes schema changes are limited: https://docs.timescale.com/timescaledb/latest/how-to-guides/compression/modify-a-schema/. However, I think the current schema is appropriate. I couldn't find another attribute that would qualify to be promoted as a column.
My current schema:
homeassistant=# \d ltest
Table "public.ltest"
Column | Type | Collation | Nullable | Default
---------------------+--------------------------+-----------+----------+---------
time | timestamp with time zone | | not null |
entity_id | text | | not null |
device_class | text | | |
state | text | | not null |
state_class | text | | |
attributes | jsonb | | not null |
location | geometry(Point,4326) | | |
friendly_name | text | | |
unit_of_measurement | text | | |
Indexes:
"ltest_device_class_time_idx1" btree (device_class, "time") WHERE device_class IS NOT NULL
"ltest_entity_id_time_idx" btree (entity_id, "time" DESC)
"ltest_time_idx" btree ("time")
homeassistant=# \d+ last_values [25/133]
View "public.last_values"
Column | Type | Collation | Nullable | Default | Storage | Description
---------------------+--------------------------+-----------+----------+---------+----------+-------------
time | timestamp with time zone | | | | plain |
entity_id | text | | | | extended |
device_class | text | | | | extended |
state | text | | | | extended |
state_class | text | | | | extended |
attributes | jsonb | | | | extended |
location | geometry(Point,4326) | | | | main |
friendly_name | text | | | | extended |
unit_of_measurement | text | | | | extended |
View definition:
WITH mat AS MATERIALIZED (
SELECT DISTINCT ON (ltest.entity_id) ltest."time",
ltest.entity_id,
ltest.device_class,
ltest.state,
ltest.state_class,
ltest.attributes,
ltest.location,
ltest.friendly_name,
ltest.unit_of_measurement
FROM ltest
WHERE ltest."time" > (now() - '7 days'::interval)
ORDER BY ltest.entity_id, ltest."time" DESC
)
SELECT mat."time",
mat.entity_id,
mat.device_class,
mat.state,
mat.state_class,
mat.attributes,
mat.location,
mat.friendly_name,
mat.unit_of_measurement
FROM mat;
Stats:
Total size
525 MB
Compression ratio
98.2%
Metrics
7380657
Size per month (compressed)
24.5 MB
Bytes per metric (compressed)
9.23 B
Bytes per metric (uncompressed)
263 B
With this query:
with table_size as (
SELECT hypertable_size
FROM hypertable_size('$table_name')
), waterline as (
select max(range_end) as range_end, min(range_start) as range_start
FROM timescaledb_information.chunks natural join chunk_compression_stats('$table_name')
where is_compressed
), metrics as (
select
count(*) filter (where time < range_end) as count_compressed,
count(*) filter (where time > range_end) as count_uncompressed
from $table_name, waterline
), size as (
select
sum(after_compression_total_bytes) as size_compressed
from chunk_compression_stats('$table_name')
)
select
count_uncompressed + count_compressed as "Metrics",
size_compressed::float / extract(epoch FROM range_end - range_start) * 86400 * 30 as "Size per month (compressed)",
size_compressed::float / count_compressed as "Bytes per metric (compressed)",
(hypertable_size::float - size_compressed) / count_uncompressed as "Bytes per metric (uncompressed)"
from table_size, metrics, size, waterline
I also have this to parse state
as bool and float, it should be a proper column on ltss
instead but was easier to work with without modifying this module
homeassistant=# \d+ ltestv
View "public.ltestv"
Column | Type | Collation | Nullable | Default | Storage | Description
---------------------+--------------------------+-----------+----------+---------+----------+-------------
time | timestamp with time zone | | | | plain |
entity_id | text | | | | extended |
device_class | text | | | | extended |
state | text | | | | extended |
state_class | text | | | | extended |
attributes | jsonb | | | | extended |
location | geometry(Point,4326) | | | | main |
friendly_name | text | | | | extended |
unit_of_measurement | text | | | | extended |
state_float | double precision | | | | plain |
state_bool | boolean | | | | plain |
View definition:
SELECT ltest."time",
ltest.entity_id,
ltest.device_class,
ltest.state,
ltest.state_class,
ltest.attributes,
ltest.location,
ltest.friendly_name,
ltest.unit_of_measurement,
try_cast_float(ltest.state) AS state_float,
try_cast_boolean(ltest.state) AS state_bool
FROM ltest;
I just tried to enable compression on mine, and I realized ltss table has a auto incrementing primary key, which makes compression kind of useless since it now has to segment on that.
Yes, see my schema just above. All timescaledb docs don't include primary keys, just time.
@Yamakaky Did you need to modify the integration?
No, I just ended up adding an insert trigger on ltss to copy data to my hypertable. It would be better to have native support.
Hey folks, apart from working at Timescale, I'm really interested in this for my own HASS instance. After installing, I wanted to enable compression and realized that due to the compound primary key, the id column needs to be part of the segment_by clause which works against the way compression is implemented (columnar storage).
@Yamakaky is going into the right direction. In terms of removing the id column, as well as probably implementing a column per data type. I just recently wrote a best practices blog post on the topic (https://www.timescale.com/blog/best-practices-for-time-series-data-modeling-narrow-medium-or-wide-table-layout-2/) where this is represented as a medium wide layout.
Happy to help getting this integration to implement the best practices (totally unselfish, I know ð). I'm just not the best person when it comes to Python code.
Alright looked a bit into it and that's my current idea. Isn't necessarily the best option yet (not enough data collected to really see the impact) but in case somebody is interested:
begin transaction;
-- enable ltree extension if not yet available
create extension if not exists ltree;
-- new optimized table
create table ltss_hass (
time timestamptz,
entity_key text not null,
state_text text null,
state_float double precision null,
state_int int8 null,
state_bool bool null,
location geometry(Point, 4326) null,
attributes_key ltree null,
primary key (time, entity_key)
);
create index idx_ltss_hass_location
on ltss_hass using gist (location);
create index idx_ltss_hass_time_entity_key
on ltss_hass using btree (time, entity_key);
select create_hypertable('ltss_hass', 'time', chunk_time_interval := interval '7 days');
alter table ltss_hass set (timescaledb.compress, timescaledb.compress_orderby = 'time, entity_key');
select add_compression_policy('ltss_hass', interval '7 days');
-- extracted attributes table
create table ltss_hass_attributes (
attributes_key ltree,
attributes jsonb not null,
exclude using gist (attributes_key with =)
);
-- save old table
alter table ltss rename to ltss_old;
-- create view trigger to redirect inserts
create or replace function redirect_ltss() returns trigger
language plpgsql as $$
declare
attributes_key ltree = null;
begin
if NEW.attributes is not null then
attributes_key = text2ltree(NEW.entity_id || '.' || encode(sha256(regexp_replace(NEW.attributes::text, '\\n', '', 'ng')::bytea), 'hex'));
end if;
if attributes_key is not null then
insert into ltss_hass_attributes (attributes_key, attributes) values (attributes_key, NEW.attributes) on conflict do nothing;
end if;
insert into ltss_hass (time, entity_key, state_text, state_float, state_int, state_bool, location, attributes_key)
values (
NEW.time,
NEW.entity_id,
case when NEW.state !~ '^[0-9]*[0-9]?\.[0-9]+$' and NEW.state !~ '^[0-9]+$' and NEW.state !~* 'off' and NEW.state !~* 'on' then NEW.state end,
case when NEW.state ~ '^[0-9]*[0-9]?\.[0-9]+$' then NEW.state::double precision end,
case when NEW.state ~ '^[0-9]+$' then NEW.state::int8 end,
case when NEW.state ~* 'on' then true when NEW.state ~* 'off' then false end,
NEW.location,
attributes_key
);
NEW.id = 1;
return NEW;
end;
$$;
-- recreate old table layout as a view
create or replace view ltss as
select
row_number() over (rows unbounded preceding) as id,
l.time,
l.entity_key as entity_id,
case
when l.state_text is not null then l.state_text
when l.state_float is not null then l.state_float::text
when l.state_int is not null then l.state_int::text
when l.state_bool is true then 'on'
when l.state_bool is false then 'off'
end as state,
l.location,
a.attributes
from ltss_hass l
left join ltss_hass_attributes a on l.attributes_key is not null and l.attributes_key = a.attributes_key;
alter view ltss owner to homeassistant;
-- enable redirect trigger
create trigger trg_redirect_ltss
instead of insert
on ltss for each row
execute function redirect_ltss();
-- migrate old data
insert into ltss (time, entity_id, state, location, attributes)
select time, entity_id, state, location, attributes
from ltss_old;
-- remove old ltss table
drop table ltss_old;
-- finished migration
end transaction;
I've extracted the actual state into a set of 4 separate (nullable) fields which will compress to almost nothing if not used and I also extracted the attributes into a separate table with an ltree key which uses a tree-like structure to find values. The actual key is a sha256 of the attributes content to make sure similar data fields reference the same (already stored) attributes value. It's a basic deduplication. At least for my sensors it seems to shrink the number of jsonb entries quite a bit (~67k rows in ltss_hass reference ~4.8k rows in ltss_hass_attributes).
Thank you all for showing such an interest in this integration!
Be aware, some personal opinions ahead!
My stand on this is:
id
column should definitely be removed to support the compression features betterI would very much like to hear your thoughts on this, feel free to criticize ð
Thank you for creating the initial integration! ð
No worries about personal opinions. They exist to be discussed and I'm all too happy to do it!
Deduplication savings using the ltree approach may vary by the available entities. For me opnsense creates a lot of duplicates (since the interface metadata almost never changes), but others too. May be my specific set of entities / sensors though, but right now I'm at 194k to 15k which is quite the substantial saving ðĨđ
@noctariusHere's an example of an entity with a seemingly binary state having more then that. "on", "off", "unknown", "unavailable"
You can't quite model this as a boolean because then you'd only have 3 states, true, false and null.
That is actually a very obvious one ð How'd I miss this. However, I think those two are general HASS elements that can happen for almost any entity, can't it? In this case, it should probably be its own column, such as "was the entity unavailable or unknown? than it won't have a real value at that time". WDYT?
Hmm. Here's another hypothetical situation. Some user configures a sensor (maybe template) that has 4 defined states, "On", "Off", "Running", "Error". Alongside the home assistant unknown and unavailable states. How would this be mapped?
Or perhaps, someone setting up a sensor with "On", "Off", "Open". What would this be mapped as? true, false, true?
What about other languages? Since state is ANY string value, I'm not entirely sure if
I think the biggest problem is that this would be making assumptions that might not always hold true for a given entity and installation, and potentially, loosing context encoded in specific value of the state.
It seems to me like any type inference logic here would be potentially fragile. Especially since template entities / helper entities are a thing that can be configured freely by end users.
To add on my usecase: in my instance, I added state_*
columns since it quickly became a pain to have casting with handling of invalid values at every query. For me, I mainly uses floats (temperature etc...) where a NULL
for unknown
and unavailable
is perfectly fine. For the few entities that have specific behaviors like @ruifung describes, having the raw state available in state_text seems fine. One argument could be made to always populate statetext, even if it matches another `state*` column.
If the semantic of an entity changes, then the user will have to handle it anyway I think. I already have that for a few entities.
For me, this all comes down to maintainability of the codebase, and, as such, I dont want to go down the road with extra columns and extra logic (which quite possibly needs to be updated every time someone decides to add a new weird combo of state "types" in text-format in some random custom component) without a really good reason. Therefore, I will keep the state as a single text column for now.
@Yamakaky If this is mainly a convenience issue with having to handle casting of invalid values in every query, my suggestion is to write a sql function for your specific usecase.
Sorry for the no-answer. Crazy couple of days ð
I see the reason why you want to stay with state as a text column only. I still find it pretty horrible from HASS side to not provide any metadata for the value but it's fair. The reasoning is pretty sound. I'll adopt my WIP PR the next few days. Just rebasing it onto 2.0 release (to be able to get it running again). Switching back from my current data set should be easy enough.
I understand your concern, however I don't think it's specific to my usecase. I assume most if not all ltss instances will need to cast to floats at some point for processing and visualisation. Maybe then provide such functions with the extension? Alternatively, have a view like in my original post with additional state_* columns would work. Iirc I moved to actual columns since it added a noticeable performance gain, not sure about the specifics though.
I adjusted the draft to only have a single text column again. I wonder if a column with a value of the guessed datatype (text, bool, float) may be interesting? That way it may be easier to filter out valid values and ignore pure text states ðĪ
@Yamakaky I could imagine, that a Continuous Aggregate may be interesting for your use case, which collected the data pre-aggregated (depending on how you'd like to aggregate them).
That would work, at the cost of duplicate data between the aggregate and the main table. Data rate is pretty slow on my side so it should be fine.
I recently started using Home Assistant with TimescaleDB and LTSS and it's working great. However, my database is quickly growing. I tried compression, but with the id column there is nothing to be gained. Then I came across this thread and it looks like the perfect solution, so I was wondering whether there are any plans on incorporating this or whether I should create my own fork? Thanks in advance.
I recently started using Home Assistant with TimescaleDB and LTSS and it's working great. However, my database is quickly growing. I tried compression, but with the id column there is nothing to be gained. Then I came across this thread and it looks like the perfect solution, so I was wondering whether there are any plans on incorporating this or whether I should create my own fork? Thanks in advance.
@nwerter I am happily accepting a PR which removes the id column (including migration code for existing databases, tests etc). I just haven't got around to it yet, lots of other things which requires attention...
id
column has now been removed starting from v2.1.0
thanks to #94
@freol35241 It seems to break my setup with compression enabled. Decompressing all the historical data doesn't seem to be feasible. Any options to still move to 2.1.0 ?
Error during connection setup: (psycopg2.errors.FeatureNotSupported) operation not supported on hypertables that have compression enabled [SQL: ALTER TABLE ltss DROP CONSTRAINT ltss_pkey CASCADE, ADD PRIMARY KEY(time,entity_id);] (Background on this error at: https://sqlalche.me/e/20/tw8g) (retrying in 3 seconds)
@velaar don't think there's any other way than decompressing and disable compression on the existing hypertable. As an alternative, you could create a new one and slowly copy the old data over chunk by chunk.
@velaar compressed chunks in Timescale can't be updated or deleted, so updating to 2.1.0, which modifies the structure of the hypertable, can only be done on an uncompressed database. Afterwards you can enable compression again, which should also be much more effective without the id column.
In order to help with the alternative suggested by noctarius, Timescale provides some guidance on backfilling historical data: https://docs.timescale.com/use-timescale/latest/compression/backfill-historical-data/
It seems to break my setup with compression enabled. Decompressing all the historical data doesn't seem to be feasible. Any options to still move to 2.1.0 ?
@velaar (and all others) You can work around it by removing the id column manually. Then the migration will not run. A primary key is not required after all. Just make sure you have a compound index on entity_id
and time
.
Side discussion/question: I manually created a compound index on (entity_id, time)
. #94 create an index on (time, entity_id)
. I'm wondering if there is a difference in performance. My motivation for choosing entity_id
first is that basically every query filters on entity_id
but not necessarily on time
.