Include support for more TimescaleDB features

freol35241 commented 3 years ago

Data retention policies
Data chunk compression (needs support for schema modifications first!)

Yamakaky commented 2 years ago

Hey ! Are you still looking for help? I did some experimentation with my HA instance.

                              Table "public.ltest"
       Column        |           Type           | Collation | Nullable | Default
---------------------+--------------------------+-----------+----------+---------
 time                | timestamp with time zone |           | not null |
 entity_id           | text                     |           | not null |
 friendly_name       | text                     |           |          |
 device_class        | text                     |           |          |
 state               | text                     |           | not null |
 state_class         | text                     |           |          |
 unit_of_measurement | text                     |           |          |
 attributes          | jsonb                    |           | not null |
 location            | geometry(Point,4326)     |           |          |

Chunk size to 7 days (mainly for testing to have more chunks)
Compression with segmentby = 'device_class, entity_id', orderby = 'time'
Chunk compression after 1 day
The additional columns are extracted and removed from attributes since they are frequent Final database is 38MB for 4.2M metrics, 98% compression rate, ~6MB/week. I don't expect to need a data retention policy for quite some time!

I'm still experimenting with indexes on device_class, entity_id and device_class. Not that compressed chunks are only indexed on these three columns since the rest is compressed.

I also tried to add computed columns state_bool and state_float with custom try_cast_float function (returns float or null) but it caused some problem with ltss compatibility so for now I have a simple view that add these columns.

freol35241 commented 2 years ago

@Yamakaky Very interesting, thanks for reporting back!

Can you please clarify:

I'm still experimenting with indexes on device_class, entity_id and device_class. Not that compressed chunks are only indexed on these three columns since the rest is compressed.

Did you mean device_class, entity_id and time rather than device_class, entity_id and device_class ?

Also, for my understanding of your tests, since the compressed chunks are not indexed on the additional columns you extract from attributes (except device_class) what is the benefit of extracting these (except convenience)?

My main issue with the compression feature of timescaledb (when I first had a look at it) was that schema changes was not supported at all. Do you know if this has changed for the better?

Yamakaky commented 2 years ago

No, I meant indexes with different combination of these three columns + time. For now I didn't do much more on that subject since the indexes mostly depend on which queries are used.

Yes, the compressed chunks are only have single column indexes on the group_by columns. The other indexes are only valid for the uncompressed days. This is still useful, for example i have a view with time > now() - interval '7 days' with the last value for every entity.

Some things are supported, but yes schema changes are limited: https://docs.timescale.com/timescaledb/latest/how-to-guides/compression/modify-a-schema/. However, I think the current schema is appropriate. I couldn't find another attribute that would qualify to be promoted as a column.

My current schema:

homeassistant=# \d ltest
                              Table "public.ltest"
       Column        |           Type           | Collation | Nullable | Default
---------------------+--------------------------+-----------+----------+---------
 time                | timestamp with time zone |           | not null |
 entity_id           | text                     |           | not null |
 device_class        | text                     |           |          |
 state               | text                     |           | not null |
 state_class         | text                     |           |          |
 attributes          | jsonb                    |           | not null |
 location            | geometry(Point,4326)     |           |          |
 friendly_name       | text                     |           |          |
 unit_of_measurement | text                     |           |          |
Indexes:
    "ltest_device_class_time_idx1" btree (device_class, "time") WHERE device_class IS NOT NULL
    "ltest_entity_id_time_idx" btree (entity_id, "time" DESC)
    "ltest_time_idx" btree ("time")

homeassistant=# \d+ last_values                                                                                                                                                                                                                                                                                                                 [25/133]
                                        View "public.last_values"
       Column        |           Type           | Collation | Nullable | Default | Storage  | Description
---------------------+--------------------------+-----------+----------+---------+----------+-------------
 time                | timestamp with time zone |           |          |         | plain    |
 entity_id           | text                     |           |          |         | extended |
 device_class        | text                     |           |          |         | extended |
 state               | text                     |           |          |         | extended |
 state_class         | text                     |           |          |         | extended |
 attributes          | jsonb                    |           |          |         | extended |
 location            | geometry(Point,4326)     |           |          |         | main     |
 friendly_name       | text                     |           |          |         | extended |
 unit_of_measurement | text                     |           |          |         | extended |
View definition:
 WITH mat AS MATERIALIZED (
         SELECT DISTINCT ON (ltest.entity_id) ltest."time",
            ltest.entity_id,
            ltest.device_class,
            ltest.state,
            ltest.state_class,
            ltest.attributes,
            ltest.location,
            ltest.friendly_name,
            ltest.unit_of_measurement
           FROM ltest
          WHERE ltest."time" > (now() - '7 days'::interval)
          ORDER BY ltest.entity_id, ltest."time" DESC
        )
 SELECT mat."time",
    mat.entity_id,
    mat.device_class,
    mat.state,
    mat.state_class,
    mat.attributes,
    mat.location,
    mat.friendly_name,
    mat.unit_of_measurement
   FROM mat;

Yamakaky commented 2 years ago

Stats:

Total size
525 MB
Compression ratio
98.2%
Metrics
7380657
Size per month (compressed)
24.5 MB
Bytes per metric (compressed)
9.23 B
Bytes per metric (uncompressed)
263 B

With this query:

with table_size as (
  SELECT hypertable_size
  FROM hypertable_size('$table_name')
), waterline as (
  select max(range_end) as range_end, min(range_start) as range_start
  FROM timescaledb_information.chunks natural join chunk_compression_stats('$table_name')
  where is_compressed
), metrics as (
  select
    count(*) filter (where time < range_end) as count_compressed,
    count(*) filter (where time > range_end) as count_uncompressed
  from $table_name, waterline
), size as (
  select
    sum(after_compression_total_bytes) as size_compressed
  from chunk_compression_stats('$table_name')
)
select
  count_uncompressed + count_compressed as "Metrics",
  size_compressed::float / extract(epoch FROM range_end - range_start) * 86400 * 30 as "Size per month (compressed)",
  size_compressed::float / count_compressed as "Bytes per metric (compressed)",
  (hypertable_size::float - size_compressed) / count_uncompressed as "Bytes per metric (uncompressed)"
from table_size, metrics, size, waterline

Yamakaky commented 2 years ago

I also have this to parse state as bool and float, it should be a proper column on ltss instead but was easier to work with without modifying this module

homeassistant=# \d+ ltestv
                                           View "public.ltestv"
       Column        |           Type           | Collation | Nullable | Default | Storage  | Description
---------------------+--------------------------+-----------+----------+---------+----------+-------------
 time                | timestamp with time zone |           |          |         | plain    |
 entity_id           | text                     |           |          |         | extended |
 device_class        | text                     |           |          |         | extended |
 state               | text                     |           |          |         | extended |
 state_class         | text                     |           |          |         | extended |
 attributes          | jsonb                    |           |          |         | extended |
 location            | geometry(Point,4326)     |           |          |         | main     |
 friendly_name       | text                     |           |          |         | extended |
 unit_of_measurement | text                     |           |          |         | extended |
 state_float         | double precision         |           |          |         | plain    |
 state_bool          | boolean                  |           |          |         | plain    |
View definition:
 SELECT ltest."time",
    ltest.entity_id,
    ltest.device_class,
    ltest.state,
    ltest.state_class,
    ltest.attributes,
    ltest.location,
    ltest.friendly_name,
    ltest.unit_of_measurement,
    try_cast_float(ltest.state) AS state_float,
    try_cast_boolean(ltest.state) AS state_bool
   FROM ltest;

ruifung commented 1 year ago

I just tried to enable compression on mine, and I realized ltss table has a auto incrementing primary key, which makes compression kind of useless since it now has to segment on that.

Yamakaky commented 1 year ago

Yes, see my schema just above. All timescaledb docs don't include primary keys, just time.

ruifung commented 1 year ago

@Yamakaky Did you need to modify the integration?

Yamakaky commented 1 year ago

No, I just ended up adding an insert trigger on ltss to copy data to my hypertable. It would be better to have native support.

noctarius commented 1 year ago

Hey folks, apart from working at Timescale, I'm really interested in this for my own HASS instance. After installing, I wanted to enable compression and realized that due to the compound primary key, the id column needs to be part of the segment_by clause which works against the way compression is implemented (columnar storage).

@Yamakaky is going into the right direction. In terms of removing the id column, as well as probably implementing a column per data type. I just recently wrote a best practices blog post on the topic (https://www.timescale.com/blog/best-practices-for-time-series-data-modeling-narrow-medium-or-wide-table-layout-2/) where this is represented as a medium wide layout.

Happy to help getting this integration to implement the best practices (totally unselfish, I know 😋). I'm just not the best person when it comes to Python code.

noctarius commented 1 year ago

Alright looked a bit into it and that's my current idea. Isn't necessarily the best option yet (not enough data collected to really see the impact) but in case somebody is interested:

begin transaction;

-- enable ltree extension if not yet available
create extension if not exists ltree;

-- new optimized table
create table ltss_hass (
    time timestamptz,
    entity_key text not null,
    state_text text null,
    state_float double precision null,
    state_int int8 null,
    state_bool bool null,
    location geometry(Point, 4326) null,
    attributes_key ltree null,
    primary key (time, entity_key)
);

create index idx_ltss_hass_location
    on ltss_hass using gist (location);

create index idx_ltss_hass_time_entity_key
    on ltss_hass using btree (time, entity_key);

select create_hypertable('ltss_hass', 'time', chunk_time_interval := interval '7 days');
alter table ltss_hass set (timescaledb.compress, timescaledb.compress_orderby = 'time, entity_key');
select add_compression_policy('ltss_hass', interval '7 days');

-- extracted attributes table
create table ltss_hass_attributes (
    attributes_key ltree,
    attributes jsonb not null,
    exclude using gist (attributes_key with =)
);

-- save old table
alter table ltss rename to ltss_old;

-- create view trigger to redirect inserts
create or replace function redirect_ltss() returns trigger
    language plpgsql as $$
declare
    attributes_key ltree = null;
begin
    if NEW.attributes is not null then
        attributes_key = text2ltree(NEW.entity_id || '.' || encode(sha256(regexp_replace(NEW.attributes::text, '\\n', '', 'ng')::bytea), 'hex'));
    end if;

    if attributes_key is not null then
        insert into ltss_hass_attributes (attributes_key, attributes) values (attributes_key, NEW.attributes) on conflict do nothing;
    end if;

    insert into ltss_hass (time, entity_key, state_text, state_float, state_int, state_bool, location, attributes_key)
    values (
        NEW.time,
        NEW.entity_id,
        case when NEW.state !~ '^[0-9]*[0-9]?\.[0-9]+$' and NEW.state !~ '^[0-9]+$' and NEW.state !~* 'off' and NEW.state !~* 'on' then NEW.state end,
        case when NEW.state ~ '^[0-9]*[0-9]?\.[0-9]+$' then NEW.state::double precision end,
        case when NEW.state ~ '^[0-9]+$' then NEW.state::int8 end,
        case when NEW.state ~* 'on' then true when NEW.state ~* 'off' then false end,
        NEW.location,
        attributes_key
    );
    NEW.id = 1;
    return NEW;
end;
$$;

-- recreate old table layout as a view
create or replace view ltss as
    select
        row_number() over (rows unbounded preceding) as id,
        l.time,
        l.entity_key as entity_id,
        case
            when l.state_text is not null then l.state_text
            when l.state_float is not null then l.state_float::text
            when l.state_int is not null then l.state_int::text
            when l.state_bool is true then 'on'
            when l.state_bool is false then 'off'
        end as state,
        l.location,
        a.attributes
    from ltss_hass l
    left join ltss_hass_attributes a on l.attributes_key is not null and l.attributes_key = a.attributes_key;
alter view ltss owner to homeassistant;

-- enable redirect trigger
create trigger trg_redirect_ltss
    instead of insert
    on ltss for each row
    execute function redirect_ltss();

-- migrate old data
insert into ltss (time, entity_id, state, location, attributes)
select time, entity_id, state, location, attributes
from ltss_old;

-- remove old ltss table
drop table ltss_old;

-- finished migration
end transaction;

I've extracted the actual state into a set of 4 separate (nullable) fields which will compress to almost nothing if not used and I also extracted the attributes into a separate table with an ltree key which uses a tree-like structure to find values. The actual key is a sha256 of the attributes content to make sure similar data fields reference the same (already stored) attributes value. It's a basic deduplication. At least for my sensors it seems to shrink the number of jsonb entries quite a bit (~67k rows in ltss_hass reference ~4.8k rows in ltss_hass_attributes).

freol35241 commented 1 year ago

Thank you all for showing such an interest in this integration!

Be aware, some personal opinions ahead!

My stand on this is:

The unnecessary id column should definitely be removed to support the compression features better
I would like to stick with the "narrow" database structure, primarily for two reasons:
- I don't like the concept with one column per type for the state. Since python is not a strictly typed language I foresee potential troubles where the same entity produce states of different types.
- Since the attributes json that comes from home assistant is ever-changing I dont think its a good idea to try to find keys that may be suitable for promotion to columns. The current approach has worked well for several years, allowing ltss to be compatible with any changes that the home assistant devs introduce upstream without leaving unused columns behind.
I do like the ltree approach to avoid duplication of the attributes though. Is ltree a built in thing of postgres or a separe extension?

I would very much like to hear your thoughts on this, feel free to criticize 😉

noctarius commented 1 year ago

Thank you for creating the initial integration! 😁

No worries about personal opinions. They exist to be discussed and I'm all too happy to do it!

We seem to all agree that the id column is to go. Great 👍
I kinda see where you're coming from with the narrow column, but what entities would create differently typed values? If this is the case you'll just pushing the issue to the later aggregation where you'd still have to figure out what type the data is and handle multi-typed data. Unfortunately (and I think this is a major fault on HASS side), there is only the state as a text representation, which is a real WTF moment for me. I think we can keep the text column, but as said, it'll only push the complexity to a later point. It doesn't solve it.
Yes ltree (https://www.postgresql.org/docs/current/ltree.html) is a built-in extension since PG11, the reason my current WIP branch just expects it to be available and screams harshly if not 🤪 I used it with my startup (before joining timescale) to model our building topologies and make quick searches in the hierarchy.

Deduplication savings using the ltree approach may vary by the available entities. For me opnsense creates a lot of duplicates (since the interface metadata almost never changes), but others too. May be my specific set of entities / sensors though, but right now I'm at 194k to 15k which is quite the substantial saving 🥹

ruifung commented 1 year ago

@noctariusHere's an example of an entity with a seemingly binary state having more then that. "on", "off", "unknown", "unavailable"

You can't quite model this as a boolean because then you'd only have 3 states, true, false and null.

noctarius commented 1 year ago

That is actually a very obvious one 😅 How'd I miss this. However, I think those two are general HASS elements that can happen for almost any entity, can't it? In this case, it should probably be its own column, such as "was the entity unavailable or unknown? than it won't have a real value at that time". WDYT?

ruifung commented 1 year ago

Hmm. Here's another hypothetical situation. Some user configures a sensor (maybe template) that has 4 defined states, "On", "Off", "Running", "Error". Alongside the home assistant unknown and unavailable states. How would this be mapped?

Or perhaps, someone setting up a sensor with "On", "Off", "Open". What would this be mapped as? true, false, true?

What about other languages? Since state is ANY string value, I'm not entirely sure if is affected by the home assistant language setting for example.

I think the biggest problem is that this would be making assumptions that might not always hold true for a given entity and installation, and potentially, loosing context encoded in specific value of the state.

It seems to me like any type inference logic here would be potentially fragile. Especially since template entities / helper entities are a thing that can be configured freely by end users.

Yamakaky commented 1 year ago

To add on my usecase: in my instance, I added state_* columns since it quickly became a pain to have casting with handling of invalid values at every query. For me, I mainly uses floats (temperature etc...) where a NULL for unknown and unavailable is perfectly fine. For the few entities that have specific behaviors like @ruifung describes, having the raw state available in state_text seems fine. One argument could be made to always populate statetext, even if it matches another `state*` column.

If the semantic of an entity changes, then the user will have to handle it anyway I think. I already have that for a few entities.

freol35241 commented 1 year ago

For me, this all comes down to maintainability of the codebase, and, as such, I dont want to go down the road with extra columns and extra logic (which quite possibly needs to be updated every time someone decides to add a new weird combo of state "types" in text-format in some random custom component) without a really good reason. Therefore, I will keep the state as a single text column for now.

@Yamakaky If this is mainly a convenience issue with having to handle casting of invalid values in every query, my suggestion is to write a sql function for your specific usecase.

noctarius commented 1 year ago

Sorry for the no-answer. Crazy couple of days 😅

I see the reason why you want to stay with state as a text column only. I still find it pretty horrible from HASS side to not provide any metadata for the value but it's fair. The reasoning is pretty sound. I'll adopt my WIP PR the next few days. Just rebasing it onto 2.0 release (to be able to get it running again). Switching back from my current data set should be easy enough.

Yamakaky commented 1 year ago

I understand your concern, however I don't think it's specific to my usecase. I assume most if not all ltss instances will need to cast to floats at some point for processing and visualisation. Maybe then provide such functions with the extension? Alternatively, have a view like in my original post with additional state_* columns would work. Iirc I moved to actual columns since it added a noticeable performance gain, not sure about the specifics though.

noctarius commented 1 year ago

I adjusted the draft to only have a single text column again. I wonder if a column with a value of the guessed datatype (text, bool, float) may be interesting? That way it may be easier to filter out valid values and ignore pure text states 🤔

@Yamakaky I could imagine, that a Continuous Aggregate may be interesting for your use case, which collected the data pre-aggregated (depending on how you'd like to aggregate them).

Yamakaky commented 1 year ago

That would work, at the cost of duplicate data between the aggregate and the main table. Data rate is pretty slow on my side so it should be fine.

nwerter commented 1 year ago

I recently started using Home Assistant with TimescaleDB and LTSS and it's working great. However, my database is quickly growing. I tried compression, but with the id column there is nothing to be gained. Then I came across this thread and it looks like the perfect solution, so I was wondering whether there are any plans on incorporating this or whether I should create my own fork? Thanks in advance.

freol35241 commented 1 year ago

I recently started using Home Assistant with TimescaleDB and LTSS and it's working great. However, my database is quickly growing. I tried compression, but with the id column there is nothing to be gained. Then I came across this thread and it looks like the perfect solution, so I was wondering whether there are any plans on incorporating this or whether I should create my own fork? Thanks in advance.

@nwerter I am happily accepting a PR which removes the id column (including migration code for existing databases, tests etc). I just haven't got around to it yet, lots of other things which requires attention...

freol35241 commented 1 year ago

id column has now been removed starting from v2.1.0 thanks to #94

velaar commented 1 year ago

@freol35241 It seems to break my setup with compression enabled. Decompressing all the historical data doesn't seem to be feasible. Any options to still move to 2.1.0 ?

Error during connection setup: (psycopg2.errors.FeatureNotSupported) operation not supported on hypertables that have compression enabled [SQL: ALTER TABLE ltss DROP CONSTRAINT ltss_pkey CASCADE, ADD PRIMARY KEY(time,entity_id);] (Background on this error at: https://sqlalche.me/e/20/tw8g) (retrying in 3 seconds)

noctarius commented 1 year ago

@velaar don't think there's any other way than decompressing and disable compression on the existing hypertable. As an alternative, you could create a new one and slowly copy the old data over chunk by chunk.

nwerter commented 1 year ago

@velaar compressed chunks in Timescale can't be updated or deleted, so updating to 2.1.0, which modifies the structure of the hypertable, can only be done on an uncompressed database. Afterwards you can enable compression again, which should also be much more effective without the id column.

In order to help with the alternative suggested by noctarius, Timescale provides some guidance on backfilling historical data: https://docs.timescale.com/use-timescale/latest/compression/backfill-historical-data/

sithmein commented 1 year ago

It seems to break my setup with compression enabled. Decompressing all the historical data doesn't seem to be feasible. Any options to still move to 2.1.0 ?

@velaar (and all others) You can work around it by removing the id column manually. Then the migration will not run. A primary key is not required after all. Just make sure you have a compound index on entity_id and time.

sithmein commented 1 year ago

Side discussion/question: I manually created a compound index on (entity_id, time). #94 create an index on (time, entity_id). I'm wondering if there is a difference in performance. My motivation for choosing entity_id first is that basically every query filters on entity_id but not necessarily on time.

freol35241 / ltss

Include support for more TimescaleDB features #25