Graceful handling of data from Redis that fails to be parsed

julianbrost commented 1 year ago

At the moment, Icinga DB completely relies on Icinga 2 that there is no data present in Redis it can't process. If this assumption is violated, Icinga DB simply crashes at the moment. There already were a few instances where such bugs were reported, for example:

Not writing data that the receiver does not understand at all is a good idea, but in order to make Icinga DB more robust, we should implement more graceful/fine-grained error handling in Icinga DB as well.

Task

Go through all code that reads from Redis and the errors that can occur, especially where data is parsed. Check if there is a more graceful way of handling the error than just bailing out. For example, for the object sync, an individual object that fails to be parsed could just be skipped. Summarize your findings in a comment to this issue and propose changes where and how things could be handled better.

Al2Klimov commented 1 year ago

propose changes where and how things could be handled better.

a health check bit in the DB telling the user to look into logs
a new sublogger (grep-able 👍) triggering the latter, probably even named health, for the following
(overdue: nothing to do here)
config+runtime: can't decode? – ignore it. (I guess. Object( update)s would disappear.)
history: here we can't do the above (ex. for state+notification) as an end event upserts its start event. But here we have History(Table)Meta we can fall back to. This gives us the object identification we can blacklist in memory for future successfully decoded events. However, ignored stuff is kept in Redis!

OK?

Al2Klimov commented 1 year ago

Almost forgot: to be honest, I'd prefer not to do anything at all here. As Eric said, our main problem is string/uint discrepancy which we fixed in 2.14/2.13.8, not individual values being provocatively out of range. E.g. if you've set you check timeout to 2^64, well, please set it back to an actual value. Or, if you've got an actual problem, please report it.

A crash is a good motivator for the latter and a reasonable one if(!) 2.14/2.13.8 is enough which it is IMAO. In contrast, a health check is easy to ignore.

julianbrost commented 1 year ago

a health check bit in the DB telling the user to look into logs

a new sublogger (grep-able +1) triggering the latter, probably even named health, for the following

These two points are for #608.

(overdue: nothing to do here)

Why? I mean this probably only takes very few inputs so likely not our biggest concern.

config+runtime: can't decode? – ignore it. (I guess. Object( update)s would disappear.)

For the initial sync (i.e. the one computing the delta first), would this leave the object as-is in the database or delete it (that might happen if you treat it like it wouldn't exist in Redis)?

history: here we can't do the above (ex. for state+notification) as an end event upserts its start event. But here we have History(Table)Meta we can fall back to. This gives us the object identification we can blacklist in memory for future successfully decoded events. However, ignored stuff is kept in Redis!

And if even that part wouldn't be decodable, it would stop the sync for the entire type? Or discard it and go on as there probably isn't hope in fixing it either?

Almost forgot: to be honest, I'd prefer not to do anything at all here. As Eric said, our main problem is string/uint discrepancy which we fixed in 2.14/2.13.8, not individual values being provocatively out of range. E.g. if you've set you check timeout to 2^64, well, please set it back to an actual value. Or, if you've got an actual problem, please report it.

I stick to "if it crashes, it's a bug", we can argue about the priority. It just doesn't look good if you can (accidentally or on purpose) input a value and this knocks out significant parts of your monitoring.

A crash is a good motivator for the latter and a reasonable one if(!) 2.14/2.13.8 is enough which it is IMAO. In contrast, a health check is easy to ignore.

What you call a good motivator potentially stops people from working with Icinga Web. If anything, that should be an argument for not having a "hide warning entirely" button.

Al2Klimov commented 1 year ago

(overdue: nothing to do here)

Why? I mean this probably only takes very few inputs so likely not our biggest concern.

It even is not a concern at all. It doesn't suffer from such problems at all. By design. So there's nothing to do here.

config+runtime: can't decode? – ignore it. (I guess. Object( update)s would disappear.)

For the initial sync (i.e. the one computing the delta first), would this leave the object as-is in the database or delete it (that might happen if you treat it like it wouldn't exist in Redis)?

I'd do the latter.

more motivating to fix the error
less confusing than it's there, but not up to date

history: here we can't do the above (ex. for state+notification) as an end event upserts its start event. But here we have History(Table)Meta we can fall back to. This gives us the object identification we can blacklist in memory for future successfully decoded events. However, ignored stuff is kept in Redis!

And if even that part wouldn't be decodable, it would stop the sync for the entire type? Or discard it and go on as there probably isn't hope in fixing it either?

Oh. 😅 I'd stop the sync for the entire type. This -if occurs- is a serious problem.

If anything, that should be an argument for not having a "hide warning entirely" button.

@nilmerg Can we do this?

Icinga / icingadb

Graceful handling of data from Redis that fails to be parsed #607

Task