Stjubit / TA-alert_forwarder

Splunk Technical Add-on that adds an Alert Action which forwards Alerts to a Splunk HTTP Event Collector
GNU General Public License v3.0
4 stars 1 forks source link

retry behaviour on >5xx series errors? #2

Closed awx-vsyr closed 2 years ago

awx-vsyr commented 2 years ago

hello,

(very nice addon :) ) quick question what is the default behaviour on 5xx errors?
from a glance at https://github.com/Stjubit/TA-alert_forwarder/blob/master/TA-alert_forwarder/bin/ta_alert_forwarder/modalert_forward_alert_to_splunk_hec_helper.py#L46 it looks to be 'don't retry' - Is that correct is retrying perhaps implied (seems unlikely given it's raw requests lib)

Would you be willing to add something like this using backoff? say retry 3 times, https://stackoverflow.com/questions/70602830/how-to-retry-python-requests-get-without-using-sessions

or perhaps using a retry adapter https://majornetwork.net/2022/04/handling-retries-in-python-requests/

:)

Stjubit commented 2 years ago

Hey 👋

Thanks for reporting this enhancement request and the positive feedback!

It totally makes sense to retry requests to the HEC and I already implemented it using the following strategy:

There might still be issues with the HEC like an invalid token, so I'd highly recommend to create an alert for that if forwarded alerts are production critical. That's why I did not implement a retry staregy in the first place. Very simple example:

(index="_internal" OR index="cim_modactions") sourcetype="modular_alerts:forward_alert_to_splunk_hec" source="/opt/splunk/var/log/splunk/forward_alert_to_splunk_hec_modalert.log" NOT INFO NOT DEBUG NOT WARN

I fixed this issue in this commit. I use repository mirroring from a private GitLab repo to this public GitHub repo, so I'm sorry that there's no pull request for this. 😄

This will be available in the next release!

~ Julian

awx-vsyr commented 2 years ago

thanks Julian, much appreciated :) is it worth making a config setting as well in case people's environment is different? also with the backoff factor how exactly does it work - in context of cim mod alert - do splunk cap max execution time/is it addon manifest adjustable and will the default setting of 5 with default timeout of 30s and backoff increasing the interval between the tries exceed the 'time to live' before splunk decides to kill the custom alert python process on the SH ?

There might still be issues with the HEC like an invalid token, so I'd highly recommend to create an alert for that if forwarded alerts are production critical.

ty yep it's always worth keeping an eye on on cim mod action failures in case of unexpected outages or splunk infra issues aside from tokens going missing

awx-vsyr commented 2 years ago

by the way in term of definite monitoring suggestion

Unable to forward alert to HEC!

that would only be in the final failed try correct?

although >(index="_internal" OR index="cim_modactions") sourcetype="modular_alerts:forward_alert_to_splunk_hec" source="/opt/splunk/var/log/splunk/forward_alert_to_splunk_hec_modalert.log" NOT INFO NOT DEBUG NOT WARN is more generic as it doesn't look for a specific error (or alternatively something like ERR OR WARN OR EXCEPT* but wildcards bad :) ) though it's a small log (well with the additional constraints of sourcetype and source).

awx-vsyr commented 2 years ago

qq 2: what is the process to get from the releases on github page to the splunk appbase version if that's alright to ask (for cloud customers)? Although I suppose could also upload as private app ?

awx-vsyr commented 1 year ago

@Stjubit hello Julian, could you please confirm about the splunk cloud app store release ?