Expaso / hassos-addons

Home Assistant Add-ons
Apache License 2.0
46 stars 14 forks source link

Pinned: Additional Documentation for Timescaledb Integration #1

Open MACscr opened 4 years ago

MACscr commented 4 years ago

Would really appreciate some additional documentation on actually integrating it with home assistant. At its default installation state, it simply installs the server. Would love additional information on getting it fully working with Home Assistant (I guess LTSS is required too?) and replacing recorder if possible (does it currently go to both)? Some examples of actually using that data within HA would be great too.

Please dontt take these as complaints as I do definitely appreciate what was done here. Just need a bit more information on how use this special integration after being sold on how good it was. =)

Expaso commented 3 years ago

Hello MACscr!

Thank you for your suggestion!

Basic Setup

The TimescaleDB addon is basically a PostgreSql Database server with TimescaleDb extentions on top of it.

Depending on what you would like to achieve, you will indeed need extra components. It's a bit like mix and match clothes. Depending on how you want to look, you pick your shirt and shoes. This makes it a little difficult to write a tutorial about how you should dress yourself. Because it's for a good part personal taste.

What I can do is describe my own setup.

First, you need this addon. It will install a timescale-db enabled sql-database for you. Seconds, you need to feed this thing with data. Third, you need ways to retrieve that data and do cool stuff with it.

I imagine you already did step 1. And so, now you need some data flowing in..

What I did was: 1- Instruct the recorder to send all it's data to Timescale, so we gain performance in parts of home-assistant like the history browser, and other chars. For this, you simple need to point your recorder-connection-string to timescale-db. This is done by simply using this connection-string for the recorder: postgresql://homeassistant:yourpasswordhere@77b2833f-timescaledb/homeassistant

This will store all data of home-assistant into your timescaledb instance. image

Homeassistant will use thse tables, but you can also query that data yourself.

2- Use LTSS for long-term storage of statistics data. For this, please see the documentation of LTSS. You can use the exact same connectionstring as in the previous step. This will feed data about sensor-changes into the LTSS table:

image

Now you have data, you can use it to create graphics. For example about the charging state of my car. I use Grafana for this.

My dashboard looks like this: image

The query behind the the top-left chars for example:

image

Please use the documentation of Grafana about how to create charts, and you need some knowledge of SQL of course. Th extra column 'state_numeric' was a computed column created by me, to aid charts that work on numeric states. You can create this computed column using this script:

CREATE OR REPLACE FUNCTION public.isnumeric(
    text)
    RETURNS boolean
    LANGUAGE 'plpgsql'
    COST 100
    IMMUTABLE STRICT PARALLEL UNSAFE
AS $BODY$
DECLARE x NUMERIC;
BEGIN
    x = $1::NUMERIC;
    RETURN TRUE;
EXCEPTION WHEN others THEN
    RETURN FALSE;
END;
$BODY$;

ALTER FUNCTION public.isnumeric(text)
    OWNER TO postgres;

and then

ALTER TABLE public.ltss
    ADD COLUMN state_numeric double precision GENERATED ALWAYS AS (
CASE
    WHEN isnumeric((state)::text) THEN (state)::double precision
    ELSE NULL::double precision
END) STORED;

I coupled Grafana to Timescale with the following config:

image

Please note that I have created a special grafana user here, for security reasons. You have to create users yourself using the PostgreSQL documentation.

This is only one of many usescases for timescaledb, but it's a real-world example. It should give you a starting point, from which you can dig deeper into your data and create nice dashoards or alerting.

Does this at least partially answer your question?

Reducing the size of the database

To save storage space, you basicially have 3 options:

1) Compress the data. You have just proven that it even can beat Influx.

Pros:

Very easy Instant space savings. Cons:

Compressed data cannot be updated. Normally not a problem (because Influx can not update at all!), but should you have the need, you must first decompress the chuck to update, and after update re-compress again. Little bit of overhead (depending on the type of query, some are faster!) when querying compressed data. 2) Remove data you donโ€™t need (like wiping the attributes column to NULL)

Pros:

Space saving Easy to do Cons:

Should you need that data in the future, bad luck. 3) Downsample your data, using Continuous Aggregates 6

Pros:

Saves a LOT of space. Will speed op queries over large timespans tremendously. Cons:

No more access to all measurements, but only the chosen aggregates values, like min/max/avg/mean within the chosen timeframe (like 5 minute/hour/day/month/etc). You have to overthink your data-needs upfront, because you basically are replacing a large set of old data with 1 record containing aggregated values. There is no way you can get your original data back if you deleted it to save space. Now, you could even combine all 3 methods in a mix and match style if you like, depending on your hunger for data. Maybe start with compression first, and create some continues aggregates to speed up queries spanning thousands of record. That way, you have a nice balance between performance, and not losing any data.

With that in place, you could start with dropping old chunks of data to save space if needed. Clearing the attributes column is more difficult with compression in place, but also

The good news: lots of options, freedom everywhere :slight_smile:

MACscr commented 3 years ago

Thanks so much for the detailed response! Ah, I really have no interest in using grafana at the moment (could easily change), but I do like the ability to be able to do queries from home assistant though and having https://www.home-assistant.io/integrations/sql/ does allow me to create some custom stuff. I really wish the awesome mini graph cards allowed us to supply exactly what we wanted it to graph versus relying on entity histories, etc. LTSS is really only useful for fetching from something like grafana or manual queries, right?

Expaso commented 3 years ago

LTSS is almost the same as the recorder. In fact LTSS was created by looking at the recorder's code.

The recorder is used for ALL state state that has to be saved, and it's history, including events in the system, and switch states andsoforth. LTSS can be (but does not have to be ) used alongside the recorder to focus a little bit more on long-term sensor-trends like temperature charts or voltage measurements etc. It's created to me more in line with Timescale.

A simple 'get me all outside temperatures for today'-query can be satisfied by both. But.. the recorder cleans up after x days, LTSS won't (because you don't want it to delete your precious data).

" I really wish the awesome mini graph cards allowed us to supply exactly what we wanted it to graph versus relying on entity histories" -Me too! A simple SQL-Graph card.

"LTSS is really only useful for fetching from something like grafana or manual queries, right?" Correct. It's just a bridge between HA and the Database, nothing more. It will get your data in, but not out. You have query it out yourself.

yoheines commented 3 years ago

This is really useful, thank you! Consider adding it to the README.md.

bingo6 commented 3 years ago

@Expaso Thank you for the detailed description which has given me some food for thought. However, I still have a few questions to understand :)

What is the advantage of the additional numeric column? I see that this is needed to use "Format as Time Series" in Grafana. But I don't understand yet what is the advantage compared to "Format as Table"?

Doesn't the additional column also lead to more needed memory? (Are the numeric states stored twice? Once in state and once in state_nummeric).

And: Does your script transfer the past states or only the future state into the new column state_nummeric? (Sorry I am not so good in SQL yet).

Expaso commented 3 years ago

Hi @bingo6 !

You're very welcome! I will try to answer your questions one by one:

"What is the advantage of the additional numeric column? I see that this is needed to use "Format as Time Series" in Grafana. But I don't understand yet what is the advantage compared to "Format as Table"?"

Home Assistant works with states. A state can be anything. A temperature, the day of the week, the location of your phone etc. Because of this, the data is stored in a text-column, so any value can be inserted. Now when you come to the point that you want to create charts for a specific sensor, say you want to plot the average temperature of your livingroom, you would first need to convert the state to a number, so you can do math on it (calculating the average). You could convert this column to a number on the fly in your query, but this comes with a cost in the form of CPU.

Because I'm running Timescale on a raspberry pi 4 with a SSD attached, I chose to convert the state to a numeric value upfront, during insert if the records using this calculated column. This will introduce a little bit of CPU overhead during insert, but will enable queries to be much much faster. Also, the queries themselves can be simpler because of the missing extra conversion needed.

_"Doesn't the additional column also lead to more needed memory? (Are the numeric states stored twice? Once in state and once in statenummeric)."

Yes. This is indeed true. The calculated column is an extra column which has to be stored, and so you need more storage space. Since I'm running of an SSD, I have plenty of space so no probs there. Storage is cheap, I rather have a snappy dashboard with near-instant performance. That said, the extra storage overhead is not large. It's only a few bytes per state extra. In fact, if you should choose to shorten the name of your entities by 1 character in home assistant, you are already making up for the extra space needed because of the savings in the attributes column ๐Ÿ‘ If you are really concerned about space, you could also choose to wipe the attributes column altogether (after insert, with a trigger for example, if you are not using it). It will save your more then 50% of the total size!

So, big gain, little loss.

_"And: Does your script transfer the past states or only the future state into the new column statenummeric? (Sorry I am not so good in SQL yet). "

All states, past and future. The generated column is computed from the 'state' field. If you would update or insert that field (old or new data), the state_numeric will follow that change immediately.

bingo6 commented 3 years ago

Thank you for the detailed answer. It convinced me completely, so I also added the column :)

Just by compressing the data, storage space is also no longer the problem and as you had already pointed out, there are other options to reduce.

I also think that adding the hints to the README.md would be useful.

jli113 commented 2 years ago

@Expaso Can you please point out where to change this timeframe? I tried to find it, but couldn't. "No more access to all measurements, but only the chosen aggregates values, like min/max/avg/mean within the chosen timeframe (like 5 minute/hour/day/month/etc)."

Expaso commented 2 years ago

Hi @jli113 ,

The timeframes are chosen by you. It's how you setup your continues aggregates. I have created 2 myself:

ltss_hourly and ltss_daily.

With both I have enough data to balance detail vs performance.

For these in-depth usage questions, it's best that I refer to the excellent documentation of Timescale-db itself:

https://docs.timescale.com/timescaledb/latest/overview/core-concepts/continuous-aggregates/#quick-start

It's key to understand these principles to get the most out of timescale.

jli113 commented 2 years ago

Thanks~

Davincible commented 2 years ago

To actuallly get the TimescaleDB advantage do we have to manually convert the events, ltss, recorder_runs and states tables to hypertables ?

Expaso commented 1 year ago

Hi @Davincible,

Home assistant tables like recorder_runs and states are just normal tables. I would suggest to leave them like that, because the HA recorder deletes data from those tables also.

That's why the LTSS integration is there: To store data for longer time.

The LTSS table therefore is created as a hypertable automatically by the LTSS integration, so you don't have to convert it manually.

edouardkleinhans commented 1 year ago

@Expaso , can you share your ltss_hourly and ltss_daily materialized view ?

Expaso commented 1 year ago

@Expaso , can you share your ltss_hourly and ltss_daily materialized view ?

Sure! Please read also: https://docs.timescale.com/api/latest/continuous-aggregates/create_materialized_view/

CREATE MATERIALIZED VIEW ltss_summary_hourly( entity_id, bucket, state_avg, state_max, state_min, state_count)
WITH (timescaledb.continuous) AS
 SELECT ltss.entity_id,
    time_bucket('1h'::interval, ltss."time") AS bucket,
    avg(ltss.state_numeric) AS state_avg,
    max(ltss.state_numeric) AS state_max,
    min(ltss.state_numeric) AS state_min,
    count(ltss.state_numeric) AS state_count
   FROM ltss
  GROUP BY ltss.entity_id, (time_bucket('1h'::interval, ltss."time"));

and

 CREATE MATERIALIZED VIEW ltss_summary_daily( entity_id, bucket, state_avg, state_max, state_min, state_count)
WITH (timescaledb.continuous) AS
 SELECT ltss.entity_id,
    time_bucket('1d'::interval, ltss."time") AS bucket,
    avg(ltss.state_numeric) AS state_avg,
    max(ltss.state_numeric) AS state_max,
    min(ltss.state_numeric) AS state_min,
    count(ltss.state_numeric) AS state_count
   FROM ltss
  GROUP BY ltss.entity_id, (time_bucket('1d'::interval, ltss."time"));
wuast94 commented 7 months ago

would be nice to have an example dashboard to import ๐Ÿ˜Š

luckylinux commented 2 months ago

@Expaso

I just discovered this very interesting project :+1:.

I was about to convert my SQLITE Database to PostgreSQL, but then I thought ... Maybe I should just convert it to TimescaleDB-HA instead.

Just to clarify though ...

With this integration, are there 2 Databases being used:

Or is it only one TimescaleDB-HA for everything ?

Thanks in advance for the Clarification :+1: .

Expaso commented 2 months ago

Hi @luckylinux ,

Basically, this addon adds a PostgreSql server to your setup, including TimescaleDb extensions. Now, HA is using SQLLite by default, which is fine. It uses that database for the recorder component, and statistics. all the graphs and historical data you see in HA are coming from this database.

Now, the recorder component does not store your data indefinitely. It has a set horizon of a few days in most cases, in which old data is being discarded.

The first logical step is to move this recorder database over from SQLLite, to this addon's Postgresql instance. Running SQLLite AND Postgres is just a waste of resources. But.. moving it to this addon also opens op more possibilities to query around your data. It makes dashboarding and monitoring with 3rd party tools like Grafana or Prometheus easier.

Now, since one of the goals of this addon is to also provide a means of storing historical data, we need more then 3 days worth of data, right? That's were the second database comes in. This database is fed by the LTSS component. this component basically is a clone of the recorder component, but is TimescaleDb aware, and does not remove old data.

Using Timescale's hypertables, you can store vast amounts of data, without much pain. For instance, I have never had to delete ANY data since I created this component a few years ago, and query happily around. The database is like 170GB or so. This is WITH Timescale compression though, but that is transparent.

So having both databases there is just logical to do. They somewhat store overlapping data for a few days, but use it for different means. Your HA database is for operational HA things, and the Timescaledb database for statistical analysis.

luckylinux commented 2 months ago

@Expaso

Thank you for your reply.

But since it's a "addon" I guess it's not really applicable to my case, where I am running homeassistant myself under Podman (Docker). I'm NOT using Hassio ...

So basically you are telling me that I should create/have 2 databases servers/containers in the end:

On top of that I also have a homeassistant-server container of Course.

Does this sounds correct ?

Expaso commented 2 months ago

@luckylinux , almost..

and ofcourse on top of that the homeassistant-server container.

luckylinux commented 2 months ago

@Expaso

Is there some difference by actually running the container standalone (e.g. docker pull husselhans/hassos-addon-timescaledb-amd64:latest) compared to what I'm doing (e.g. docker pull timescale/timescaledb-ha:pg16-all), besides the fact that I will have 2 containers running (which IMHO could be useful for maintenance) ?

I mean ... Maybe I don't understand at all, but it seems to me like the timescaledb container is pretty standard/stock. The main part of the project is to explain the Configuration of HomeAssistant in order to take Advantage of it. And If I understand correctly, that is the purpose of this Issue Thread.

No offense, I'm just trying to understand :100: .

Expaso commented 2 months ago

The purpose of this thread is to provide information about my timescaledb-addon and how to use it with home-assistant.

First, I wouldn't pull , please use a version tag. This makes sure that you have a chance reading the release notes before I launch a new major version and possibly do breaking changes. I hate ruining your day ;)

These are the major differences between the image you mention and this addon-image:

But please choose as you like. It's open-source right?

luckylinux commented 2 months ago

@Expaso

That's quite clear, thanks :+1: .

So, if I understand correctly, your "addon" is not so much an "addon" (at least not for me if I run it as a Container), but it's rather a "stock" timescaledb-ha (or timescaledb ?) with some helpers scripts and additional programs/toolkits installed.

Just to clarify: it's basically not something homeassistant-specific, but rather a "normal" (generic) timescaledb with extra packages. Correct ? Maybe I was confused by the name of the image after all ...

I am fine running your Container, I'd just like to understand better.

Right now you can argue I don't even have a backup strategy for other containers' postgresql Databases (except ZFS snapshots that is).

So of course I am open to options ;).

I wonder then why postgres and timescaledb-ha do not include the toolkit and pgAgent by default ...

EDIT: alright, this is more where I should be looking for the Container Details: https://github.com/Expaso/hassos-addon-timescaledb/releases rather than this GitHub Repository itself.

Expaso commented 2 months ago

Hahaha yes, that is indeed de place to be!

My addon IS a real homeassistant addon with all bells and whistles, but you can also run it standalone. Just because some people want to run it in plain docker or kubernetes or whatever. It's just a feature I have added in the lastest release because some people run homeassistant on a raspberry pi, but want to run this addon on more beefier hardware. It is a fully capable database server as a matter of fact.

luckylinux commented 2 months ago

I finally managed to get my migration script kinda working now ...

Some very tricky to diagnose DNS Name Resolution of Podman made me lose a few days :upside_down_face:. I might need to cleanup the repos quite a bit ...

https://github.com/luckylinux/migrate-sql-database#general

Now I need to decide which Database to use :laughing:.

42Network commented 1 month ago

Hi. I believe there are two small errors in your Realized Views: state_min is missing in the list of fields, resulting in an obscure error: ERROR: column "state_count" specified more than once

With state_min added, the error is gone.

CREATE MATERIALIZED VIEW ltss_summary_hourly( entity_id, bucket, state_avg, state_max, state_min, state_count)
WITH (timescaledb.continuous) AS
 SELECT ltss.entity_id,
    time_bucket('1h'::interval, ltss."time") AS bucket,
    avg(ltss.state_numeric) AS state_avg,
    max(ltss.state_numeric) AS state_max,
    min(ltss.state_numeric) AS state_min,
    count(ltss.state_numeric) AS state_count
   FROM ltss
  GROUP BY ltss.entity_id, (time_bucket('1h'::interval, ltss."time"));

CREATE MATERIALIZED VIEW ltss_summary_daily( entity_id, bucket, state_avg, state_max, state_min, state_count)
WITH (timescaledb.continuous) AS
 SELECT ltss.entity_id,
    time_bucket('1d'::interval, ltss."time") AS bucket,
    avg(ltss.state_numeric) AS state_avg,
    max(ltss.state_numeric) AS state_max,
    min(ltss.state_numeric) AS state_min,
    count(ltss.state_numeric) AS state_count
   FROM ltss
  GROUP BY ltss.entity_id, (time_bucket('1d'::interval, ltss."time"));
Expaso commented 1 month ago

@42Network You are totally right! Thank you for your fix!! I have editied the original post!