Icinga / icinga-notifications

Icinga Notifications: new and improved notifications and incident management for Icinga (work in progress, not ready for production yet)
GNU General Public License v2.0
9 stars 0 forks source link

Incremental Configuration Updates #191

Closed oxzi closed 2 months ago

oxzi commented 5 months ago

Previously, the entire configuration stored in the database was synchronized every second. With the growth of configurations in live environments on the horizon, this would simply not scale well. This brings us to incremental updates.

By introducing two new columns - "changed_at" as a Unix millisecond timestamp and "deleted" as a boolean - for all tables referenced in the ConfigSet structure, SQL queries can be modified to retrieve only those rows with a more recent timestamp. The "deleted" column became necessary to detect disappearances, since the synchronization now only takes newer items into account. Some additional fields needed to be added to the ConfigSet to track relationships.

Even though the codebase served well at the time, there was some code that did almost the same thing as other code, just in different ways. So a huge refactoring was done. This resulted in an internal generic function that handles all synchronization with custom callbacks.

The web counterpart is being developed in https://github.com/Icinga/icinga-notifications-web/pull/187.

Closes #5.

oxzi commented 5 months ago

After talking with @sukhwinder33445 about this change as he implements the web counterpart, I have removed the incr_cfg_delete function and its trigger usage, rewriting a DELETE to an UPDATE setting deleted = 'y' on the main configuration tables.

For historical reasons and if it might become relevant again, I have attached the diff below.

incr_cfg_delete diff ```diff diff --git a/schema/pgsql/schema.sql b/schema/pgsql/schema.sql index a5abfea..08f5406 100644 --- a/schema/pgsql/schema.sql +++ b/schema/pgsql/schema.sql @@ -88,6 +88,23 @@ CREATE FUNCTION incr_cfg_bump_changed_at_relation_2nd() -- join_tbl text, foreig END; $$; +-- incr_cfg_delete is a BEFORE TRIGGER for DELETE, which sets changed_at to the current timestamp and raises the deleted +-- column instead of actually deleting the row. +CREATE FUNCTION incr_cfg_delete() + RETURNS trigger + LANGUAGE plpgsql + AS $$ + BEGIN + -- Cannot change OLD's attributes, as this would require returning OLD which implies continuing with deletion. + EXECUTE format(' + UPDATE %s + SET changed_at = EXTRACT(EPOCH FROM NOW()) * 1000, deleted = ''y'' + WHERE id = %s', + TG_TABLE_NAME, OLD.id); + RETURN NULL; + END; + $$; + CREATE TABLE available_channel_type ( type text NOT NULL, name text NOT NULL, @@ -119,6 +136,11 @@ CREATE TRIGGER trg_channel_incr_cfg_update FOR EACH ROW EXECUTE FUNCTION incr_cfg_bump_changed_at(); +CREATE TRIGGER trg_channel_incr_cfg_delete + BEFORE DELETE ON channel + FOR EACH ROW + EXECUTE FUNCTION incr_cfg_delete(); + CREATE TABLE contact ( id bigserial, full_name text NOT NULL, @@ -140,6 +162,11 @@ CREATE TRIGGER trg_contact_incr_cfg_update FOR EACH ROW EXECUTE FUNCTION incr_cfg_bump_changed_at(); +CREATE TRIGGER trg_contact_incr_cfg_delete + BEFORE DELETE ON contact + FOR EACH ROW + EXECUTE FUNCTION incr_cfg_delete(); + CREATE TABLE contact_address ( id bigserial, contact_id bigint NOT NULL REFERENCES contact(id), @@ -160,6 +187,11 @@ CREATE TRIGGER trg_contact_address_incr_cfg_update FOR EACH ROW EXECUTE FUNCTION incr_cfg_bump_changed_at(); +CREATE TRIGGER trg_contact_address_incr_cfg_delete + BEFORE DELETE ON contact_address + FOR EACH ROW + EXECUTE FUNCTION incr_cfg_delete(); + CREATE TABLE contactgroup ( id bigserial, name text NOT NULL, @@ -178,6 +210,11 @@ CREATE TRIGGER trg_contactgroup_incr_cfg_update FOR EACH ROW EXECUTE FUNCTION incr_cfg_bump_changed_at(); +CREATE TRIGGER trg_contactgroup_incr_cfg_delete + BEFORE DELETE ON contactgroup + FOR EACH ROW + EXECUTE FUNCTION incr_cfg_delete(); + -- Changes to contactgroup_member should be notified by an updated contactgroup.changed_at. CREATE TABLE contactgroup_member ( contactgroup_id bigint NOT NULL REFERENCES contactgroup(id), @@ -208,6 +245,11 @@ CREATE TRIGGER trg_schedule_incr_cfg_update FOR EACH ROW EXECUTE FUNCTION incr_cfg_bump_changed_at(); +CREATE TRIGGER trg_schedule_incr_cfg_delete + BEFORE DELETE ON schedule + FOR EACH ROW + EXECUTE FUNCTION incr_cfg_delete(); + CREATE TABLE timeperiod ( id bigserial, owned_by_schedule_id bigint REFERENCES schedule(id), -- nullable for future standalone timeperiods @@ -225,6 +267,11 @@ CREATE TRIGGER trg_timeperiod_incr_cfg_update FOR EACH ROW EXECUTE FUNCTION incr_cfg_bump_changed_at(); +CREATE TRIGGER trg_timeperiod_incr_cfg_delete + BEFORE DELETE ON timeperiod + FOR EACH ROW + EXECUTE FUNCTION incr_cfg_delete(); + -- Changes to timeperiod_entry should be notified by an updated timeperiod.changed_at. CREATE TABLE timeperiod_entry ( id bigserial, @@ -315,6 +362,11 @@ CREATE TRIGGER trg_source_incr_cfg_update FOR EACH ROW EXECUTE FUNCTION incr_cfg_bump_changed_at(); +CREATE TRIGGER trg_source_incr_cfg_delete + BEFORE DELETE ON source + FOR EACH ROW + EXECUTE FUNCTION incr_cfg_delete(); + CREATE TABLE object ( id bytea NOT NULL, -- SHA256 of identifying tags and the source.id source_id bigint NOT NULL REFERENCES source(id), @@ -377,6 +429,11 @@ CREATE TRIGGER trg_rule_incr_cfg_update FOR EACH ROW EXECUTE FUNCTION incr_cfg_bump_changed_at(); +CREATE TRIGGER trg_rule_incr_cfg_delete + BEFORE DELETE ON rule + FOR EACH ROW + EXECUTE FUNCTION incr_cfg_delete(); + -- Changes to rule_escalation should be notified by an updated rule.changed_at. CREATE TABLE rule_escalation ( id bigserial, diff --git a/schema/pgsql/upgrades/025.sql b/schema/pgsql/upgrades/025.sql index ee5f595..3c36de4 100644 --- a/schema/pgsql/upgrades/025.sql +++ b/schema/pgsql/upgrades/025.sql @@ -52,6 +52,21 @@ CREATE FUNCTION incr_cfg_bump_changed_at_relation_2nd() -- join_tbl text, foreig END; $$; +CREATE FUNCTION incr_cfg_delete() + RETURNS trigger + LANGUAGE plpgsql + AS $$ + BEGIN + -- Cannot change OLD's attributes, as this would require returning OLD which implies continuing with deletion. + EXECUTE format(' + UPDATE %s + SET changed_at = EXTRACT(EPOCH FROM NOW()) * 1000, deleted = ''y'' + WHERE id = %s', + TG_TABLE_NAME, OLD.id); + RETURN NULL; + END; + $$; + ALTER TABLE channel ADD COLUMN changed_at bigint NOT NULL DEFAULT EXTRACT(EPOCH FROM NOW()) * 1000, ADD COLUMN deleted boolenum NOT NULL DEFAULT 'n'; @@ -98,21 +113,41 @@ CREATE TRIGGER trg_channel_incr_cfg_update FOR EACH ROW EXECUTE FUNCTION incr_cfg_bump_changed_at(); +CREATE TRIGGER trg_channel_incr_cfg_delete + BEFORE DELETE ON channel + FOR EACH ROW + EXECUTE FUNCTION incr_cfg_delete(); + CREATE TRIGGER trg_contact_incr_cfg_update BEFORE INSERT OR UPDATE ON contact FOR EACH ROW EXECUTE FUNCTION incr_cfg_bump_changed_at(); +CREATE TRIGGER trg_contact_incr_cfg_delete + BEFORE DELETE ON contact + FOR EACH ROW + EXECUTE FUNCTION incr_cfg_delete(); + CREATE TRIGGER trg_contact_address_incr_cfg_update BEFORE INSERT OR UPDATE ON contact_address FOR EACH ROW EXECUTE FUNCTION incr_cfg_bump_changed_at(); +CREATE TRIGGER trg_contact_address_incr_cfg_delete + BEFORE DELETE ON contact_address + FOR EACH ROW + EXECUTE FUNCTION incr_cfg_delete(); + CREATE TRIGGER trg_contactgroup_incr_cfg_update BEFORE INSERT OR UPDATE ON contactgroup FOR EACH ROW EXECUTE FUNCTION incr_cfg_bump_changed_at(); +CREATE TRIGGER trg_contactgroup_incr_cfg_delete + BEFORE DELETE ON contactgroup + FOR EACH ROW + EXECUTE FUNCTION incr_cfg_delete(); + CREATE TRIGGER trg_contactgroup_member_changed_at_relation AFTER INSERT OR UPDATE OR DELETE ON contactgroup_member FOR EACH ROW @@ -123,11 +158,21 @@ CREATE TRIGGER trg_schedule_incr_cfg_update FOR EACH ROW EXECUTE FUNCTION incr_cfg_bump_changed_at(); +CREATE TRIGGER trg_schedule_incr_cfg_delete + BEFORE DELETE ON schedule + FOR EACH ROW + EXECUTE FUNCTION incr_cfg_delete(); + CREATE TRIGGER trg_timeperiod_incr_cfg_update BEFORE INSERT OR UPDATE ON timeperiod FOR EACH ROW EXECUTE FUNCTION incr_cfg_bump_changed_at(); +CREATE TRIGGER trg_timeperiod_incr_cfg_delete + BEFORE DELETE ON timeperiod + FOR EACH ROW + EXECUTE FUNCTION incr_cfg_delete(); + CREATE TRIGGER trg_timeperiod_entry_changed_at_relation AFTER INSERT OR UPDATE OR DELETE ON timeperiod_entry FOR EACH ROW @@ -143,11 +188,21 @@ CREATE TRIGGER trg_source_incr_cfg_update FOR EACH ROW EXECUTE FUNCTION incr_cfg_bump_changed_at(); +CREATE TRIGGER trg_source_incr_cfg_delete + BEFORE DELETE ON source + FOR EACH ROW + EXECUTE FUNCTION incr_cfg_delete(); + CREATE TRIGGER trg_rule_incr_cfg_update BEFORE INSERT OR UPDATE ON rule FOR EACH ROW EXECUTE FUNCTION incr_cfg_bump_changed_at(); +CREATE TRIGGER trg_rule_incr_cfg_delete + BEFORE DELETE ON rule + FOR EACH ROW + EXECUTE FUNCTION incr_cfg_delete(); + CREATE TRIGGER trg_rule_escalation_changed_at_relation AFTER INSERT OR UPDATE OR DELETE ON rule_escalation FOR EACH ROW ```
julianbrost commented 5 months ago

My thinking when suggesting to not have changed_at/deleted in every config table but only in the main ones was a bit that should also be the natural thing to do in Web. For example, if you add an recipient somewhere in an escalation, you do this on the edit event rule page and you click save there, so there it would only be necessary to update rule.changed_at when the form is saved, all the changes to other tables would be covered implicitly by that.

Now this turned in to quite a bit of complex trigger mechanics instead. I would definitely consider adding these two columns to all tables instead of requiring three variants of the trigger functions. Is everything you implemented in PostgreSQL triggers also possible with MySQL and MariaDB?

as well as raising the deleted column instead of actually performing the DELETE command.

~Does raising have any special meaning in this context? Or is it simply supposed to say SET deleted = 'y'?~

~Also, I'm not yet sure how good the idea is to hook DELETE in a way that it doesn't actually delete, i.e. might this be too surprising so that this could result in unexpected behavior? The only downside I see with executing an UPDATE ... SET deleted = 'y' instead would be that with the current current distinction into main/child config tables, you'd have to make a difference between those. If we'd decide to add the columns to every table, this would no longer be the case. (But on the other hand, there would then be a difference between adding a new group membership and reviving a previously deleted one.)~

Edit: this part became irrelevant by itself, I wrote this comment before seeing the previous comment.

sukhwinder33445 commented 5 months ago

I would say that if for example the rule is updated via form (rule.name ="UpdatedName"), the changed_at column should be updated automatically. Maridb/Mysql can do this with ON UPDATE. If an INSERT/UPDATE/DELETE takes place in the related (child) tables, the web should (can) update the changed_at column of the main(parent) table manually.

julianbrost commented 5 months ago

I would say that if for example the rule is updated via form (rule.name ="UpdatedName"), the changed_at column should be updated automatically.

If an INSERT/UPDATE/DELETE takes place in the related (child) tables, the web should (can) update the changed_at column of the main(parent) table manually.

Why do you want to make a distinction here? Would that make anything easier for web?

Maridb/Mysql can do this with ON UPDATE.

Does this even work here? Reads like this is a magic feature just for TIMESTAMP and DATETIME which we don't use.

sukhwinder33445 commented 5 months ago

Why do you want to make a distinction here? Would that make anything easier for web?

Yes. For example, if the rule name is changed, the form returns new values that we can write directly to $db->update('rule', $formValues);. But now $formValues also needs the column changed_at. Adding this column is no big deal, but if db can do it itself, why should we do it manually. This will also prevent future errors if someone forgets to add this column to the $formValues.

Does this even work here? Reads like this is a magic feature just for TIMESTAMP and DATETIME which we don't use.

But changed_at is handled as timestamp.

Update: I thought it would be simpler than the changes on the web, but it's not, and it's getting more complex with the triggers. And the data type is not a real timestamp. I can understand your reasoning now.

Web should do all that manually, and the triggers can be removed.

oxzi commented 4 months ago

While the triggers allowed me to offload a lot of complexity to the database, it now went back to the notification daemon. I hope that I have found most of the obvious bugs that can occur, especially during partial updates. This version now works "trigger-free" and has the two columns in all tables, including relationship tables. Please give it a try and report back.

oxzi commented 4 months ago

Small note, mostly for future me: I have just rebased against the current main, nothing more. I am going to address @yhabteab's comments later. Furthermore, I talked with @julianbrost about the ConfigSet, which could contain every table/relation instead of groups of relations, easing fetching logic and allowing more unification.

oxzi commented 4 months ago

TODO: Fetch each table by its own. This "flattening" should ease the synchronization and might even allow more code reusage.

sukhwinder33445 commented 3 months ago

Are the sql files missing by mistake, or have you decided for a new implementation? I assume that all tables mentioned here and the referenced(child) tables get the deleted and changed_at flag (as in the recently removed sql file). The implementation in the web is based on this assumption.

oxzi commented 3 months ago

@sukhwinder33445: Sorry, the schema and an update are now part of the work in progress commit.

oxzi commented 3 months ago

You have defined json tags for all ChangedAt and Deleted fields, but seem to ignore them completely in all update function hooks. Is that intentional? If so, then the json tags are probably useless.

I have completed only half of the changes so far! But what about this question?

@yhabteab: I have already removed the json tags as they were, as you mentioned, useless. My initial idea was to refer to them within the JSON dump, but as most other fields are missing such a tag, I removed them here as well.

yhabteab commented 3 months ago

You have defined json tags for all ChangedAt and Deleted fields, but seem to ignore them completely in all update function hooks. Is that intentional? If so, then the json tags are probably useless.

I have completed only half of the changes so far! But what about this question?

@yhabteab: I have already removed the json tags as they were, as you mentioned, useless. My initial idea was to refer to them within the JSON dump, but as most other fields are missing such a tag, I removed them here as well.

I mean, okay, you removed the json tag completely and my comment didn't even propose to remove the tags at all, but rather if you think these fields should also be included in the config dump, you need to update them regularly like the other fields, otherwise just exclude them via json:"-". Currently, they should still appear in the config dump with false-positive values, since both fields are exported.

oxzi commented 2 months ago

I have tried to work on all of your comments and closed those discussions I though I have properly addressed. Please feel free to reopen those, if you think otherwise.

Furthermore, I made sure that the changed_at value is also always being updated, as @yhabteab pointed out.

Please take another look and sorry for the broken diff as I needed to rebase the code :/

oxzi commented 2 months ago

Next to addressing @julianbrost's comments, I have introduced two additional changes:

  1. The configuration verification will only be executed if any changes were actually performed and not for no-ops. Therefore I introduced a small utils.Flag helper-type.
  2. The type of Rotation.Priority was changed from int32 to sql.NullInt32 as the column is defined as NULLable in the schema and, as it happened, somehow became NULL in my testing setup, resulting in SQL type errors and a broken synchronization.
oxzi commented 2 months ago

I have changed the detection of initially deleted elements to cover this special case of the latest change being a deletion. Regarding the log entry flood of created elements, I reconsidered the current log levels and have degraded everything except errors and the summary after changes to DEBUG.

julianbrost commented 2 months ago

I was only able to get the error again by manually editing some deleted rotation members in the database.

2024-07-10T11:00:42.886+0200    ERROR   runtime-updates Creating configuration element failed   {"table": "rotation_member", "id": 30, "new": {"id": 30, "rotation_id": 16, "contact_group_id": 1}, "error": "creation callback error, schedule rotation member refers unknown rotation 16"}

Did editing involve setting deleted = 'n' again on that row? If yes, that's not something that's really intended to be done (or if you do, you'd at least have to undelete all other dependencies as well). Otherwise, that might be a bug as you shouldn't be able to end up in the "Creating configuration element" case with a deleted = 'y' row.

oxzi commented 2 months ago

I was only able to get the error again by manually editing some deleted rotation members in the database.

2024-07-10T11:00:42.886+0200    ERROR   runtime-updates Creating configuration element failed   {"table": "rotation_member", "id": 30, "new": {"id": 30, "rotation_id": 16, "contact_group_id": 1}, "error": "creation callback error, schedule rotation member refers unknown rotation 16"}

Did editing involve setting deleted = 'n' again on that row? If yes, that's not something that's really intended to be done (or if you do, you'd at least have to undelete all other dependencies as well). Otherwise, that might be a bug as you shouldn't be able to end up in the "Creating configuration element" case with a deleted = 'y' row.

The actual error from the "creation callback [states, that the] schedule rotation member refers unknown rotation 16". The creation callback cannot[^1] be called upon an deleted = 'y' element. Thus, I would guess that you have manually revived a Rotation Member belonging to a still deleted Rotation? Otherwise, something would be wrong.

[^1]: Should not. I really hope so, at least.

yhabteab commented 2 months ago

Thus, I would guess that you have manually revived a Rotation Member belonging to a still deleted Rotation?

That's exactly what I mean when I wrote "by manually editing". Btw. just updated my local branch for the web part and now it is allowing me to delete event rules and noticed this noise.

2024-07-10T15:34:15.325+0200    WARN    incident        Incident refers unknown rule    {"object": "dummy-7!onesecond-0", "incident": "#125", "rule_id": 2}

If you don't drop that faulty rule from incident#Rules, it will always complain about that, and I'm not sure how to handle that properly. Sorry, I didn't notice that previously!

oxzi commented 2 months ago

Btw. just updated my local branch for the web part and now it is allowing me to delete event rules and noticed this noise.

2024-07-10T15:34:15.325+0200    WARN    incident        Incident refers unknown rule    {"object": "dummy-7!onesecond-0", "incident": "#125", "rule_id": 2}

If you don't drop that faulty rule from incident#Rules, it will always complain about that, and I'm not sure how to handle that properly. Sorry, I didn't notice that previously!

Good catch that I was able to reproduce.

There are a few relationship relations between the incident relation - not directly addressed by this PR as it is no configuration - and relations being touched by this PR.

Furthermore, there is a link between the source relation and object, with object being referred by incident and event.

On the other side, there is also available_channel_type being referred by channel, with the latter having the changed_at and deleted columns.

I am currently taking a look how to reflect deletions from the configuration tables there.

oxzi commented 2 months ago

Following new fancy :sparkles: changes were introduced: